文章目录
- 前言
- 语言处理与Python
- 一、语言计算:文本和单词
- 1、NLTK入门
- (1)安装(nltk、nltk.book)
- (2)搜索文本
- (3)词汇计数
- 2、列表与字符串
- (1)列表操作
- (2)索引列表
- (3)变量
- (4)字符串
- 二、计算语言:简单的统计
- 1、频率分布
- 2、细粒度的选择词
- (1)选出长度大于15的单词
- (2)频繁出现的长词
- (3)提取词汇中的次对
- (4)提取文本中的频繁出现的双连词
- 3、计数其他东西
- (1)文本中词长的分布
- (2)[w for w in text if condition ]
- (3)条件循环
- 三、理解自然语言
- 四、作业
原书:
前言
从广义上讲,“自然语言处理”(Natural Language Processing 简称NLP)包含所有用计算机对自然语言进行的操作。
NLTK 定义了一个使用Python 进行NLP 编程的基础工具。它提供重新表示自然语言处理相关数据的基本类,词性标注、文法分析、文本分类等任务的标准接口以及这些任务的标准实现,可以组合起来解决复杂的问题。
语言处理任务与相应NLTK 模块以及功能描述
语言处理任务 | 语言处理任务 | 功能 |
访问语料库 | corpus | 语料库与词典的标准化接口 |
字符串处理 | tokenize, stem | 分词,分句,提取主干 |
搭配的发现 | collocations | t-检验,卡方,点互信息PMI |
词性标注 | tag | n-gram, backoff, Brill, HMM, TnT |
机器学习 | classify, cluster, tbl | 决策树,最大熵,贝叶斯,EM,k-means |
分块 | chunk | 正则表达式,n-gram,命名实体 |
解析 | parse, ccg | 图表,基于特征,一致性,概率,依赖 |
语义解释 | sem, inference | λ演算,一阶逻辑,模型检验 |
指标评测 | metrics | 精度,召回率,协议系数 |
概率和估计 | probability | 频率分布,平滑概率分布 |
应用 | app, chat | 图形化的语料库检索工具,分析器WordNet 查看器,聊天机器人 |
语言学领域的工作 | toolbox | 处理SIL 工具箱格式的数据 |
语言处理与Python
一、语言计算:文本和单词
1、NLTK入门
(1)安装(nltk、nltk.book)
安装 nltk.book
import nltk
nltk.download()
使用 nltk.download()
浏览可用的软件包.下载器上Collections 选项卡显示软件包如何被打包分组,选择book 标记所在行,可以获取本书的例子和练习所需的全部数据。
from nltk.book import *
print("text1 : ", text1)
print("text2 : ", text2)
输出结果
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1 : <Text: Moby Dick by Herman Melville 1851>
text2 : <Text: Sense and Sensibility by Jane Austen 1811>
(2)搜索文本
# 词语索引:搜索文本text1中含有“monstrous”的句子
print(text1.concordance("monstrous"))
# 搜索文本text1中与“monstrous”相似的单词
print(text1.similar("monstrous"))
# 搜索文本text2中两个单词共同的上下文
print(text2.common_contexts(["monstrous", "very"]))
# 显示在文本text4中各个单词的使用频率,显示为词汇分布图
print(text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]))
输出结果
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
None
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
None
a_pretty am_glad a_lucky is_pretty be_glad
None
(3)词汇计数
# 文本text3的符号总数
print(len(text3))
# 不重复的符号排序,注意:排序表中大写字母出现在小写字母之前。
print(sorted(set(text3)))
# 不重复的符号总数
print(len(set(text3)))
# 词汇丰富度:不重复符号占总符号5%,或者:每个单词平均使用16词
print(len(set(text3)) / len(text3))
# 文本中“smote”的计数
print(text3.count("smote"))
print(100 * text4.count('a') / len(text4))
print('--------------------'*2)
# 计算词汇丰富度
def lexical_diversity(text):
return len(set(text)) / len(text)
# 计算词word在文本text中出现的频率
def percentage(word, text):
return 100 * text.count(word) / text
print(lexical_diversity(text3))
print(lexical_diversity(text5))
print(percentage('a', text4))
输出结果
44764
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', ..., 'With', 'Woman', 'Ye', 'Yea', 'Yet', 'Zaavan', 'Zaphnathpaaneah', 'Zar', 'Zarah', 'Zeboiim', 'Zeboim', 'Zebul', 'Zebulun', 'Zemarite', 'Zepho', 'Zerah', 'Zibeon', 'Zidon', 'Zillah', 'Zilpah', 'Zimran', 'Ziphion', 'Zo', 'Zoar', 'Zohar', 'Zuzims', 'a', 'abated', 'abide', 'able', 'abode', 'abomination', 'about', 'above', 'abroad', 'absent', 'abundantly', 'accept', 'accepted', 'according', 'acknowledged', 'activity', 'add', ..., 'yielded', 'yielding', 'yoke', 'yonder', 'you', 'young', 'younge', 'younger', 'youngest', 'your', 'yourselves', 'youth']
2789
0.06230453042623537
5
1.4643016433938312
----------------------------------------
0.06230453042623537
0.13477005109975562
1.4643016433938312
2、列表与字符串
(1)列表操作
print('sent2 : ', sent2)
# 连接 : 将多个列表组合为一个列表。
print('List : ', ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'])
# 追加 : 增加一个元素
print('sent1 : ', sent1)
sent1.append("Some")
print('sent1 : ', sent1)
输出结果
sent2 : ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
List : ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail']
sent1 : ['Call', 'me', 'Ishmael', '.']
sent1 : ['Call', 'me', 'Ishmael', '.', 'Some']
(2)索引列表
# 利用索引获取文本
print(text4[173])
# 利用文本获得第一次出现的索引
print(text4.index('awaken'))
# 切片:从大文本中任意抽取语言片段,即获取子列表
print(text5[16715:16735])
print(text6[1600:1625])
sent = ['word1', 'word2', 'word3', 'word4', 'word5',
'word6', 'word7', 'word8', 'word9', 'word10']
print(sent[5:8]) # sent[5]\sent[6]\sent[7]
print(sent[0])
print(sent[9])
sent[0] = 'First'
sent[9] = 'Last'
# 用新内容替换掉一整个片段
sent[1:9] = ['Second', 'Third']
print(sent)
# 这个链表只有四个元素而要获取其后面的元素就产生了错误
# print(sent[9])
输出结果
awaken
173
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it']
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week']
['word6', 'word7', 'word8']
word1
word10
['First', 'Second', 'Third', 'Last']
# Traceback (most recent call last):
# File "/home/jie/Jie/codes/nlp/1_nltk.py", line 60, in <module>
# print(sent[9])
# IndexError: list index out of range
(3)变量
形式:变量 = 表达式
(4)字符串
用来访问列表元素的一些方法也可以用在单独的词或字符串上。
name = 'Monty'
# 索引, 切片
print(name[0])
print(name[:4])
# 乘法,加法
print(name * 2)
print(name + '!')
输出结果
M
Mont
MontyMonty
Monty!
字符串与列表的相互转换
print(' '.join(['Monty', 'Python']))
print('Monty Python'.split())
输出结果
Monty Python
['Monty', 'Python']
二、计算语言:简单的统计
1、频率分布
频率分布:在文本中的每一个词项出现的频率。
# 频率分布: 文本中单词词符的总数是如何分布在词项中的
fdist1 = FreqDist(text1)
print(fdist1)
print(fdist1.most_common(50))
print(fdist1['whale']) # whale的词频统计
# 累积频率图
# 《白鲸记》中50个最常用词的累积频率图:这些词占了所有词符的将近一半。
fdist1.plot(50, cumulative=True)
# 只出现了一次的词
print(fdist1.hapaxes())
输出结果
<FreqDist with 19317 samples and 260819 outcomes>
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
906
['Herman', 'Melville', ']', 'ETYMOLOGY', 'Late', 'Consumptive', 'School', 'threadbare', 'lexicons', 'mockingly', 'flags', 'mortality', 'signification', 'HACKLUYT', 'Sw',...'suction', 'closing', 'Ixion', 'Till', 'liberated', 'Buoyed', 'dirgelike', 'padlocks', 'sheathed', 'retracing', 'orphan']
《白鲸记》中50个最常用词的累积频率图:这些词占了所有词符的将近一半。
2、细粒度的选择词
(1)选出长度大于15的单词
V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words), '\n')
输出结果
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
(2)频繁出现的长词
# 所有长度超过7 个字符,且出现次数超过7 次的词
fdist5 = FreqDist(text5)
long_words1 = [w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]
print(long_words1, '\n')
输出结果
['remember', '((((((((((', 'listening', '#talkcity_adults', 'actually', 'football', 'seriously', 'something', 'innocent', 'everyone', 'Question', 'watching', '#14-19teens', 'anything', 'computer', 'tomorrow', 'together', '........', 'cute.-ass']
(3)提取词汇中的次对
bigrams_words = bigrams(['more', 'is', 'said', 'than', 'done'])
print(list(bigrams_words), '\n')
输出结果
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
(4)提取文本中的频繁出现的双连词
collocations()
:提取频繁出现的双连词
print(text4.collocations(), '\n')
输出结果
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
None
3、计数其他东西
(1)文本中词长的分布
# 文本中词长的频数
fdist = FreqDist(len(w) for w in text1)
print(fdist)
print(fdist.most_common())
print(fdist.max())
# 词频中词长为“3”的频数
print(fdist[3])
# 词频中词长为“3”的频率
print(fdist.freq(3))
输出结果
<FreqDist with 19 samples and 260819 outcomes>
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
3
50223
0.19255882431878046
分析:最频繁的词长度是3,长度为3 的词有50,000 多个(约占书中全部词汇的20%)
(2)[w for w in text if condition ]
# 选出以ableness结尾的单词
print(sorted(w for w in set(text1) if w.endswith('ableness')))
# 选出含有gnt的单词
print(sorted(term for term in set(text4) if 'gnt' in term))
# 选出以大写字母开头的单词
print(sorted(item for item in set(text6) if item.istitle()))
# 选出数字
print(sorted(item for item in set(sent7) if item.isdigit()))
# 选出不全部是小写字母的单词
print(sorted(w for w in set(sent7) if not w.islower()))
# 将单词变为全部大写字母
print([w.upper() for w in text1])
# 将text1中过滤掉不是字母的,然后全部转换成小写,然后去重,然后计数
print(len(set(word.lower() for word in text1 if word.isalpha())))
输出结果
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
['Sovereignty', 'sovereignties', 'sovereignty']
['A', 'Aaaaaaaaah', ... , 'Woa', 'Wood', 'Would', 'Y', 'Yapping', 'Yay', 'Yeaaah', 'Yeaah', 'Yeah', 'Yes', 'You', 'Your', 'Yup', 'Zoot']
['29', '61']
[',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken']
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', '(', 'SUPPLIED', 'BY', 'SHARP', 'BLEAK', 'CORNER', ',', 'WHERE', ... , 'WILD', 'OATS', 'IN', 'ALL', 'FOUR', 'OCEANS', '.', 'THEY', 'HAD', 'MADE', 'A', 'HARP
(3)条件循环
示例1:
for token in sent1:
if token.islower():
print(token, 'is a lowercase word')
elif token.istitle():
print(token, 'is a titlecase word')
else:
print(token, 'is punctuation')
输出结果
Call is a titlecase word
me is a lowercase word
Ishmael is a titlecase word
. is punctuation
示例2:
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
for word in tricky:
# 不换行打印print(word, end=' ')
print(word, end=' ')
输出结果
ancient ceiling conceit conceited conceive conscience conscientious conscientiously deceitful deceive deceived deceiving deficiencies deficiency deficient delicacies excellencies fancied insufficiency insufficient legacies perceive perceived perceiving prescience prophecies receipt receive received receiving society species sufficient
三、理解自然语言
关键点:信息提取、推理与总结
四、作业
1. 下面两行之间的差异是什么?哪一个的值比较大?其他文本也是同样情况吗?
sorted(set([w.lower() for w in text1]))
sorted([w.lower() for w in set(text1)]
第二个更大。因为第一个是先执行小写再执行set 相同的元素只保留一个; 而第二个里先执行了set ,大小写不同的同一元素都会保留下来,然后再执行小写操作,会出现相同的都是小写的元素。
2. w.isupper()
和 not w.islower()
这两个测试之间的差异是什么?
w.isupper()——返回的是w是否为全大写的字母
not w.islower()——返回的是w是否全不是小写字母(可能包含数字等)
3. 编写一个切片表达式提取text2中的最后两个词。
text2[-2:]
# ['THE', 'END']
4. 找出聊天语聊库(text5)中所有4个字母的词。使用频率分布函数(FreqDist),以频率从高到低显示这些词。
fdist = FreqDist([w for w in text5 if len(w)==4])
print(fdist.most_common())
输出结果
[('JOIN', 1021), ('PART', 1016), ('that', 274), ('what', 183), ('here', 181), ('....', 170), ('have', 164), ('like', 156), ('with', 152), ('chat', 142), ('your', 137), ('good', 130), ('just', 125), ('lmao', 107), ..., ('brwn', 1), ('hurr', 1), ('Were', 1)]
5. 写表达式找出text6中所有符合下列条件的词。结果应该是单词列表的形式:[‘word1’, ‘word2’, …]。
- 以ize 结尾
- 包含字母z
- 包含字母序列pt
- 除了首字母外是全部小写字母的词(即titlecase)
print([w for w in text6 if w.endswith('ize')])
print([w for w in text6 if 'z' in w])
print([w for w in text6 if 'pt' in w])
print([w for w in text6 if w.istitle()])