在一段句子中是由各种词汇组成的。有名词,动词,形容词和副词。要理解这些句子,首先就需要将这些词类识别出来。将词汇按它们的词性(parts-of-speech,POS)分类并相应地对它们进行标注。这个过程叫做词性标注。
要进行词性标注,就需要用到词性标注器(part-of-speech tagger).代码如下
text=nltk.word_tokenize("customer found there are abnormal issue")
print(nltk.pos_tag(text))
提示错误:这是因为找不到词性标注器
LookupError:
**********************************************************************
Resource averaged_perceptron_tagger not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
Searched in:
- '/home/zhf/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'
**********************************************************************
运行nltk.download进行下载,并将文件拷贝到前面错误提示的搜索路径中去,
>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
True
以及对应的帮助文档
>>> nltk.download('tagsets')
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data] Unzipping help/tagsets.zip.
True
运行结果:
[('customer', 'NN'), ('found', 'VBD'), ('there', 'EX'), ('are', 'VBP'), ('abnormal', 'JJ'), ('issue', 'NN')]
在这里得到了每个词以及每个词的词性。下表是一个简化的词性标记集
标记 | 含义 | 例子 |
ADJ | 形容词 | new, good, high, special, big, local |
ADV | 动词 | really, already, still, early, now |
CNJ | 连词 | and, or, but, if, while, although |
DET | 限定词 | the, a, some, most, every, no |
EX | 存在量词 | there, there’s |
FW | 外来词 | dolce, ersatz, esprit, quo, maitre |
MOD | 情态动词 | will, can, would, may, must, should |
N | 名词 | year, home, costs, time, education |
NP | 专有名词 | Alison, Africa, April, Washington |
NUM | 数词 | twenty-four, fourth, 1991, 14:24 |
PRO | 代词 | he, their, her, its, my, I, us |
P | 介词 | on, of, at, with, by, into, under |
TO | 词 to | to |
UH | 感叹词 | ah, bang, ha, whee, hmpf, oops |
V | 动词 | is, has, get, do, make, see, run |
VD | 过去式 | said, took, told, made, asked |
VG | 现在分词 | making, going, playing, working |
VN | 过去分词 | given, taken, begun, sung |
WH | Wh 限定词 | who, which, when, what, where, how |
如果解析的对象是由单独的词/标记字符串构成的,可以用str2tuple的方法将词和标记解析出来并形成元组。使用方法如下:
[nltk.tag.str2tuple(t) for t in "customer/NN found/VBD there/EX are/VBP abnormal/JJ issue/NN".split()]
运行结果:
[('customer', None), ('found', None), ('there', None), ('are', None), ('abnormal', None), ('issue', None)]
对于在NLTK中自带的各种文本,也自带词性标记器
nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
那么借助与Freqdist和以及绘图工具。我们就可以画出各个词性的频率分布图,便于我们观察句子结构
brown_news_tagged=nltk.corpus.brown.tagged_words(categories='news')
tag_fd=nltk.FreqDist(tag for (word,tag) in brown_news_tagged)
tag_fd.plot(50,cumulative=True)
结果如下,绘制出了前50个
假如我们正在学习一个词,想看下它在文本中的应用,比如后面都用的什么词。可以采用如下的方法,我想看下oftern后面都跟的是一些什么词语
brown_learned_text=nltk.corpus.brown.words(categories='learned')
ret=sorted(set(b for(a,b) in nltk.bigrams(brown_learned_text) if a=='often'))
在这里用到了bigrams方法,这个方法主要是形成双连词。
比如下面的这段文本,生成双连词
for word in nltk.bigrams("customer found there are abnormal issue".split()):
print(word)
结果如下:
('customer', 'found')
('found', 'there')
('there', 'are')
('are', 'abnormal')
('abnormal', 'issue')
光看后面跟了那些词语还不够,我们还需要查看后面的词语都是一些什么词性。
1 首先是对词语进行词性标记。形成词语和词性的二元组
2 然后根据bigrams形成连词,然后根据第一个词是否是often,得到后面词语的词性
brown_learned_text=nltk.corpus.brown.tagged_words(categories='learned')
tags=[b[1] for (a,b) in nltk.bigrams(brown_learned_text) if a[0]=='often']
fd=nltk.FreqDist(tags)
fd.tabulate()
结果如下:
VBN VB VBD JJ IN QL , CS RB AP VBG RP VBZ QLP BEN WRB . TO HV
15 10 8 5 4 3 3 3 3 1 1 1 1 1 1 1 1 1 1
同样的,如果我们想的到三连词, 可以采用trigrams的方法。