NLTK安装与使用--输出文本词性

原创

guog算法笔记 2024-09-09 16:19:25 博主文章分类：Python_问题解决办法 ©著作权

文章标签 数学建模 python 词性词性标注 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者guog算法笔记的原创作品，请联系作者获取转载授权，否则将追究法律责任

NLTK安装与使用--输出文本词性

一、安装
二、案例
三、词性表示与含义

NLTK代表"Natural Language Toolkit"，它是一个用于自然语言处理（NLP）的Python库。NLTK提供了广泛的工具和资源，用于处理、分析、操作和理解人类语言数据。

NLTK是一个开源项目，旨在促进和支持NLP研究和开发。它提供了丰富的功能和算法，包括文本处理、词性标注、分词、语法分析、语义分析、语料库管理、词向量、机器学习等等。NLTK还提供了大量的语料库、词典和语言数据集，可以用于训练和评估NLP模型。

使用NLTK，开发人员可以轻松地处理和分析文本数据，从而构建各种NLP应用程序，如文本分类、信息抽取、机器翻译、问答系统等。它也是学术界和教育界中教授和研究NLP的重要工具之一。

NLTK是一个功能强大、易于使用的Python库，为NLP任务提供了丰富的工具和资源，使开发人员能够处理和分析人类语言数据。

一、安装

优秀教程1：
NLTK库安装教程（详细版）

优秀教程2：
NLTK安装方法

简言之

pip install nltk

再在脚本中使用

import nltk
nltk.download('averaged_perceptron_tagger')

二、案例

demo1: 取出所有名词

要识别出列表中所有单词的词性并筛选出名词，您可以使用自然语言处理工具，如NLTK（Natural Language Toolkit）库，它提供了一些功能来处理文本和词性标注。

以下是一个使用NLTK库来实现词性标注和名词筛选的示例代码：

# 名词

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

def filter_nouns(word_list):
    # 使用pos_tag函数对单词列表进行词性标注
    tagged_words = pos_tag(word_list)
    
    # 筛选出名词
    nouns = [word for word, pos in tagged_words if pos.startswith('N')]
    
    return nouns

word_list = ['As', 'a', 'history', 'of', 'Custer', ',', 'this', "insn't", 'even', 'close', '(', 'Custer', 'dies', 'to', 'help', 'the', 'indians', '?', 'I']

nouns = filter_nouns(word_list)
print(nouns)

输出

['history', 'Custer', "insn't", 'Custer', 'indians']

词性标注并不是完美的，有时可能会出现错误的标注结果。因此，您可能需要对输出的结果进行验证和进一步处理，以确保得到准确的名词列表。

demo2: 取出每个词的词性

# 每个词的词性
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

def tag_pos(words):
    tagged_words = pos_tag(words)
    return tagged_words

word_list = ['As', 'a', 'history', 'of', 'Custer', ',', 'this', "insn't", 'even', 'close', '(', 'Custer', 'dies', 'to', 'help', 'the', 'indians', '?', 'I']

tagged_words = tag_pos(word_list)
print(tagged_words)

输出：

[('As', 'IN'), ('a', 'DT'), ('history', 'NN'), ('of', 'IN'), ('Custer', 'NNP'), (',', ','), ('this', 'DT'), ("insn't", 'NN'), ('even', 'RB'), ('close', 'RB'), ('(', '('), ('Custer', 'NNP'), ('dies', 'VBZ'), ('to', 'TO'), ('help', 'VB'), ('the', 'DT'), ('indians', 'NNS'), ('?', '.'), ('I', 'PRP')]

在NLTK中，pos_tag函数返回的词性标签遵循Penn Treebank标签集。在这个标签集中，名词的标签以大写字母"N"开头。

demo3: 取出专有名词

要筛选出专有名词，可以使用pos_tag函数返回的词性标签，通过判断标签是否为"NNP"或"NNPS"来确定。

# 专有名词
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

def filter_proper_nouns(words):
    tagged_words = pos_tag(words)
    proper_nouns = [word for word, pos in tagged_words if pos.startswith('NNP') or pos.startswith('NNPS')]
    return proper_nouns

word_list = ['As', 'a', 'history', 'of', 'Custer', ',', 'this', "isn't", 'even', 'close', '(', 'Custer', 'dies', 'to', 'help', 'the', 'indians', '?', 'I']

proper_nouns = filter_proper_nouns(word_list)
print(proper_nouns)

输出：

['Custer', 'Custer']

在上述代码中，pos.startswith('NNP')用于判断词性标签是否以"NNP"开头，表示单数专有名词（例如人名、地名等）。同样，pos.startswith('NNPS')用于判断词性标签是否以"NNPS"开头，表示复数专有名词。通过这两个条件，可以筛选出列表中的专有名词。

三、词性表示与含义

在上面的三个demo中，词性都是通过pos_tag(words) 返回词性。

pos_tag(words)函数返回一个由单词和词性标签组成的元组列表，其中每个元组表示输入单词的词性标注。

对于词性标签，NLTK库使用了Penn Treebank标签集。以下是一些常见的词性标签及其含义解释：

CC: Coordinating conjunction（并列连词）
CD: Cardinal number（基数词）
DT: Determiner（限定词）
EX: Existential there（存在句中的there）
FW: Foreign word（外来词）
IN: Preposition or subordinating conjunction（介词或从属连词）
JJ: Adjective（形容词）
JJR: Adjective, comparative（形容词，比较级）
JJS: Adjective, superlative（形容词，最高级）
LS: List item marker（列表项标记）
MD: Modal（情态动词）
NN: Noun, singular or mass（名词，单数或不可数名词）
NNS: Noun, plural（名词，复数）
NNP: Proper noun, singular（专有名词，单数）
NNPS: Proper noun, plural（专有名词，复数）
PDT: Predeterminer（前位限定词）
POS: Possessive ending（所有格结束词）
PRP: Personal pronoun（人称代词）
PRP$: Possessive pronoun（物主代词）
RB: Adverb（副词）
RBR: Adverb, comparative（副词，比较级）
RBS: Adverb, superlative（副词，最高级）
RP: Particle（小品词）
SYM: Symbol（符号）
TO: to（to 介词）
UH: Interjection（感叹词）
VB: Verb, base form（动词，基本形式）
VBD: Verb, past tense（动词，过去式）
VBG: Verb, gerund or present participle（动词，动名词或现在分词）
VBN: Verb, past participle（动词，过去分词）
VBP: Verb, non-3rd person singular present（动词，非第三人称单数现在时）
VBZ: Verb, 3rd person singular present（动词，第三人称单数现在时）
WDT: Wh-determiner（疑问限定词）
WP: Wh-pronoun（疑问代词）
WP$: Possessive wh-pronoun（疑问代词的所有格形式）
WRB: Wh-adverb（疑问副词）

通过使用这些标签，可以了解每个单词的词性类型。标签集可能因不同的语料库和任务而有所差异。因此，具体情况可能会有所不同。