一、文本切分
1.句子切分:将文本语料库分解为句子的过程
句子切分技术,使用NLTK 框架进行切分,该框架提供用于执行句子切分的各种接口,有sent_tokenize , PunktSentenceTokenizer, RegexpTokenizer, 预先训练的句子切分模型
import nltk
from pprint import pprint#pprin和print功能基本一样,pprint打印出的数据结构更加完整,采用分行打印
sample_text='We will discuss briefly about the basic syntax,structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming lanuage!'
#方法一
sample_sentences=nltk.sent_tokenize(text=sample_text)
#方法二
punkt_st=nltk.tokenize.PunktSentenceTokenizer()
sample_sentences=punkt_st.tokenize(sample_text)
pprint(sample_sentences)
》》》》
['We will discuss briefly about the basic syntax,structure and design '
'philosophies.',
'There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!',
'Python is a really powerful programming lanuage!']
2.词语切分:将句子分割为其组成单词的过程。
依然在nltk框架下,主流接口有word_tokenize, TreebankWordTokenizer,RegexpTokenizer,RegexpTokenizer继承的切分器
import nltk
sentence="The brown for wasn't that quick and he couldn't win the race"
words=nltk.word_tokenize(sentence)
print(words)
treebank_wk=nltk.TreebankWordTokenizer()
words=treebank_wk.tokenize(sentence)
print(words)
《《《《《
print(words)
['The', 'brown', 'for', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']
二、文本规范化
一下代码加载基本依存关系和将使用的语料库
import nltk
import re
import string
from pprint import pprint
corpus=["The brown fox wasn't that quick and couldn't win the race","Hey that's a great deal! I just bought a phone for $199", "@@You'll(learn) a **lot** in the book . Python is amazing language!@@"]
1.文本清洗:删除无关不必要标识和字符
2.文本切分:
import nltk
import re
import string
from pprint import pprint
corpus=["The brown fox wasn't that quick and couldn't win the race","Hey that's a great deal! I just bought a phone for $199", "@@You'll(learn) a **lot** in the book . Python is amazing language!@@"]
#文本切分
def tokenize_tex(text):
sentences=nltk.sent_tokenize(text)
word_tokens=[nltk.word_tokenize(sentence) for sentence in sentences]
return word_tokens
token_list=[tokenize_tex(text) for text in corpus]
pprint(token_list)
>>>>
[[['The',
'brown',
'fox',
'was',
"n't",
'that',
'quick',
'and',
'could',
"n't",
'win',
'the',
'race']],
[['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
[['@',
'@',
'You',
"'ll",
'(',
'learn',
')',
'a',
'*',
'*',
'lot',
'*',
'*',
'in',
'the',
'book',
'.'],
['Python', 'is', 'amazing', 'language', '!'],
['@', '@']]]
3.删除特殊字符
def remove_characters_after_tokenization(tokens):
pattern=re.compile('[{}]'.format(re.escape(string.punctuation)))#删除特殊字符
filtered_tokens=list(filter(None,[pattern.sub('',token) for token in tokens]))
return filtered_tokens
#按以下书上的代码,我没跑出来
filtered_list_1=[filter(None,[remove_characters_after_tokenization(tokens) for tokens in sentence_tokens]) for sentence_tokens in token_list]
print(filtered_list_1)
#这是我修改以后的代码
sentence_list=[]
for sentence_tokens in token_list:
for tokens in sentence_tokens:
print(tokens)
sentence_list.append(remove_characters_after_tokenization(tokens))
>>>>>
#结果已经不含特殊字符了
[['The',
'brown',
'fox',
'was',
'nt',
'that',
'quick',
'and',
'could',
'nt',
'win',
'the',
'race'],
['Hey', 'that', 's', 'a', 'great', 'deal'],
['I', 'just', 'bought', 'a', 'phone', 'for', '199'],
['You', 'll', 'learn', 'a', 'lot', 'in', 'the', 'book'],
['Python', 'is', 'amazing', 'language'],
[]]
4.扩展缩写词
import contractions
from contractions import CONTRACTION_MAP
def expand_contractions(sentence,contraction_mapping):
contractions_pattern=re.compile('({})'.format('|'.join(contraction_mapping.keys())),flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match=contraction.group(0)
first_char=match[0]
expanded_contraction=contraction_mapping.get(match)\
if contraction_mapping.get(match)\
else contraction_mapping.get(match.lower())
expanded_contraction=first_char+expanded_contraction[1:]
return expanded_contraction
expanded_sentence=contractions_pattern.sub(expand_match,sentence)
return expanded_sentence
expanded_corpus=[expand_contractions(sentence,CONTRACTION_MAP) for sentence in sentence_list]
print(expanded_corpus)
5.大小写转换
print(corpus[0].lower())
print(corpus[0].upper())
>>>>>
the brown fox wasn't that quick and couldn't win the race
THE BROWN FOX WASN'T THAT QUICK AND COULDN'T WIN THE RACE
6.删除停用词(删除没有或者极小意义的词)
def remove_stopwords(tokens):
stopword_list=nltk.corpus.stopwords.words('english')
filtered_tokens=[token.lower() for token in tokens if token.lower() not in stopword_list]
return filtered_tokens
corpus_tokens=[tokenize_text(text) for text in corpus]#先用前文定义的tokenize_text函数分割文章
filted_list_3=[[remove_stopwords(tokens) for tokens in sentence_tokens] for sentence_tokens in corpus_tokens]
>>>>>>对比以下结果
stopword_list#都是以小写字母展示
Out[69]:
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
corpus_tokens
Out[68]:
[[['The',
'brown',
'fox',
'was',
"n't",
'that',
'quick',
'and',
'could',
"n't",
'win',
'the',
'race']],
[['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
[['@',
'@',
'You',
"'ll",
'(',
'learn',
')',
'a',
'*',
'*',
'lot',
'*',
'*',
'in',
'the',
'book',
'.'],
['Python', 'is', 'amazing', 'language', '!'],
['@', '@']]]
filted_list_3
Out[67]:
[[['brown', 'fox', "n't", 'quick', 'could', "n't", 'win', 'race']],
[['hey', "'s", 'great', 'deal', '!'], ['bought', 'phone', '$', '199']],
[['@', '@', "'ll", '(', 'learn', ')', '*', '*', 'lot', '*', '*', 'book', '.'],
['python', 'amazing', 'language', '!'],
['@', '@']]]
7.校正重复字符
非正式的英文表达中常有重复字符的情况
#校正重复字符
import nltk
import re
from nltk.corpus import wordnet
def remove_repleated_characters(tokens):
repeat_pattern=re.compile(r'(\w*)(\w)\2(\w*)')#用该模式识别单词中两个不同 之间的重复字符
match_substition=r'\1\2\3'#用置换方法消除一个重复字符
def replace(old_word):
if wordnet.synsets(old_word):
return old_word#判断单词是否存在在语料库中,存在则保留
new_word=repeat_pattern.sub(match_substition,old_word)
return replace(new_word) if new_word!=old_word else new_word
correct_tokens=[replace(word) for word in tokens]
return correct_tokens
sample_sentences="My school is reallllly amaaazningggg"
sample_sentence=tokenize_text(sample_sentences)
print(remove_repleated_characters(sample_sentence[0]))
>>>>>>>>
sample_sentence
Out[24]: [['My', 'school', 'is', 'reallllly', 'amaaazningggg']]
print(remove_repleated_characters(sample_sentence[0]))
['My', 'school', 'is', 'really', 'amazning']
8.词干提取
词干是单词的基本形式,可通过在词干上添加词缀来创建新词,词干不一定是标准正确的单词。nltk包中含几种实现算法:PorterStemmer ;LancasterStemmer;RegexpStemmer;SnowballStemmber,各算法实现方法不一致
#词干提取PorterStemmer
from nltk.stem import PorterStemmer
ps=PorterStemmer()
print(ps.stem('jumping'),ps.stem('jumps'),ps.stem('jumped'),ps.stem('lying'),ps.stem('strange'))
>>>>>>
jump jump jump lie strang
9.词形还原
#词形还原,词元(lemma)始终在词典中
from nltk.stem import WordNetLemmatizer
wnl=WordNetLemmatizer()
print(wnl.lemmatize('cars','n'))
print(wnl.lemmatize('running','v'))
>>>>>>
car
run
以上就是处理,规范化,标准化文本的内容。