NLP生成段落摘要怎么弄 nlp断句

转载

冷月星 2024-01-30 21:27:27

文章标签 NLP生成段落摘要怎么弄自然语言处理 ci 词切分词性 文章分类 NLP 人工智能

NLTK（www.nltk.org）是在处理预料库、分类文本、分析语言结构等多项操作中最长遇到的包。其收集的大量公开数据集、模型上提供了全面、易用的接口，涵盖了分词、词性标注(Part-Of-Speech tag, POS-tag)、命名实体识别(Named Entity Recognition, NER)、句法分析(Syntactic Parse)等各项 NLP 领域的功能。

1. 分词

（1）句子切分（断句）

（2）单词切分（分词）

2. 处理切词

（1）移除标点符号

（2）移除停用词

3. 词汇规范化（Lexicon Normalization）

（1）词形还原（lemmatization）

（2）词干提取（stem）

4. 词性标注

5. 获取近义词

NLTK模块及功能介绍：

NLP生成段落摘要怎么弄 nlp断句_自然语言处理

1. 分词

文本是由段落（Paragraph）构成的，段落是由句子（Sentence）构成的，句子是由单词构成的。切词是文本分析的第一步，它把文本段落分解为较小的实体（如单词或句子），每一个实体叫做一个Token，Token是构成句子（sentence ）的单词、是段落（paragraph）的句子。NLTK能够实现句子切分和单词切分两种功能。

（1）句子切分（断句）

句子切分是指把段落切分成句子：

from nltk.tokenize import sent_tokenize

text="""Hello Mr. Smith, how are you doing today? The weather is great, and 
city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""

tokenized_text=sent_tokenize(text)

print(tokenized_text)


'''
结果：
  ['Hello Mr. Smith, how are you doing today?', 
   'The weather is great, and city is awesome.The sky is pinkish-blue.', 
   "You shouldn't eat cardboard"]
'''

（2）单词切分（分词）

单词切分是把句子切分成单词

import nltk

sent = "I am almost dead this time"

token = nltk.word_tokenize(sent)

结果：token['I','am','almost','dead','this','time']

2. 处理切词

对切词的处理，需要移除标点符号和移除停用词和词汇规范化。

（1）移除标点符号

对每个切词调用该函数，移除字符串中的标点符号，string.punctuation包含了所有的标点符号，从切词中把这些标点符号替换为空格。

# 方式一
import string

s = 'abc.'
s = s.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))  # abc


# 方式二
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']
text_list = [word for word in text_list if word not in english_punctuations]

（2）移除停用词

停用词（stopword）是文本中的噪音单词，没有任何意义，常用的英语停用词，例如：is, am, are, this, a, an, the。NLTK的语料库中有一个停用词，用户必须从切词列表中把停用词去掉。

nltk.download('stopwords')
# Downloading package stopwords to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\stopwords.zip.

from nltk.corpus import stopwords
stop_words = stopwords.words("english")

text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome."""

word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_word = [w for w in word_tokens if not w in stop_words]


'''
word_tokens：['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?',
 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.']
filtered_word：['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.']
'''

3. 词汇规范化（Lexicon Normalization）

词汇规范化是指把词的各种派生形式转换为词根，在NLTK中存在两种抽取词干的方法porter和wordnet。

（1）词形还原（lemmatization）

真实的单词。

（2）词干提取（stem）

可能不是真正的单词。

from nltk.stem.wordnet import WordNetLemmatizer  # from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()  # 词形还原

from nltk.stem.porter import PorterStemmer   # from nltk.stem import PorterStemmer
stem = PorterStemmer()   # 词干提取

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

'''
Lemmatized Word: fly
Stemmed Word: fli
'''

4. 词性标注

词性（POS）标记的主要目标是识别给定单词的语法组，POS标记查找句子内的关系，并为该单词分配相应的标签。

sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens = nltk.word_tokenize(sent)

tags = nltk.pos_tag(tokens)

'''
[('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), 
('in', 'IN'), ('Ulm', 'NNP'), (',', ','), ('Germany', 'NNP'), ('in', 'IN'), ('1879', 'CD'), ('.', '.')]
'''

5. 获取近义词

查看一个单词的同义词集用synsets(); 它有一个参数pos，可以指定查找的词性。WordNet接口是面向语义的英语词典，类似于传统字典。它是NLTK语料库的一部分。

import nltk
nltk.download('wordnet')  # Downloading package wordnet to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\wordnet.zip.

from nltk.corpus import wordnet

word = wordnet.synsets('spectacular')
print(word)
# [Synset('spectacular.n.01'), Synset('dramatic.s.02'), Synset('spectacular.s.02'), Synset('outstanding.s.02')]

print(word[0].definition())
print(word[1].definition())
print(word[2].definition())
print(word[3].definition())

'''
a lavishly produced performance
sensational in appearance or thrilling in effect
characteristic of spectacles or drama
having a quality that thrusts itself into attention
'''

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：ios Plist 文件位置在哪 plist打开

下一篇：获取yarn的资源信息获取资源的方式包括

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯