python 拆分文件 python文本拆分

转载

mob6454cc6cee7e 2023-05-26 18:26:51

文章标签 字符串 Python 机器学习 文章分类 Python 后端开发

在机器学习实战一书朴素贝叶斯部分提及文本切分,切分文本的常用方法是使用split()函数,无法分开形如 M.L.的字符串,

实例如下:

mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'

mySent.split()
Out[23]: 
['This','book','is','the','best','book','on','Python','or','M.L.','I','have','ever','laid','eyes','upon.']

很明显,可以看到'M.L.'和'upon.'与我们所期望提取到的字符串有较大偏差

这里推荐使用re 模块中的正则表示式来切分句子,其中的分隔符是除单词,数字外的任意字符串.

import re
regEx=re.compile(r'\W*')
lists=regEx.split(mySent)
__main__:1: FutureWarning: split() requires a non-empty pattern match.

lists
Out[28]: 
['This',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'Python',
 'or',
 'M',
 'L',
 'I',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon',
 '']

再外加判断字符串长度来除去' '空字符

[tok.lower() for tok in lists if len(tok)>0]
Out[29]: 
['this',
 'book',
 'is',
 'the',
 'best',
 'book',
 'on',
 'python',
 'or',
 'm',
 'l',
 'i',
 'have',
 'ever',
 'laid',
 'eyes',
 'upon']

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。