python网页正文提取 python提取网页文字

转载

mob6454cc6caa80 2023-05-18 11:28:14

文章标签 python 爬虫 nltk html html5 文章分类 Python 后端开发

Python爬取网站内容并进行文字预处理(英文)

注：输出部分用省略号代替...
爬取网站

'''
import urllib.request

response = urllib.request.urlopen('http://php.net/')
html = response.read()
print(html)
'''

输出：

'''
b'\n\n\n\n \n \n\n

PHP: Hypertext Preprocessor\n\n \n \n <link rel="alternate" type="application/atom+xml" href="http://php.net/releases/feed.php" ...

'''

转换为干净文本

'''
import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup=BeautifulSoup(html,"html5lib") # 这需要安装html5lib模块
text = soup.get_text(strip=True)

-- text -- 获取了一个干净的文本

print(text)
'''
输出为:
'''
PHP: Hypertext PreprocessorDownloadsDocumentationGet InvolvedHelpGetting StartedIntroductionA simple tutorialLanguage ReferenceBasic ......
'''

转换为tokens
'''
import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup=BeautifulSoup(html,"html5lib") # 这需要安装html5lib模块
text = soup.get_text(strip=True)

-- text -- 获取了一个干净的文本

-- 将文本转换为tokens

tokens = text.split()
print(tokens)
'''
输出为：
'''
['PHP:', 'Hypertext', 'PreprocessorDownloadsDocumentationGet', 'InvolvedHelpGetting', 'StartedIntroductionA', 'simple', 'tutorialLanguage', 'ReferenceBasic',...'''

完整版 python爬取文字加分词预处理(英文)

'''
import nltk

nltk.download()

import urllib.request
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup=BeautifulSoup(html,"html5lib") # 这需要安装html5lib模块
text = soup.get_text(strip=True)

-- text -- 获取了一个干净的文本

-- 将文本转换为tokens

tokens = text.split()

# -- 计算频率

freq = nltk.FreqDist(tokens)

for key,val in freq.items():

print(str(key)+':'+str(val))

# -- 画图

freq.plot(20,cumulative=False)

-- 处理停用词

stopwords.words('english') # 注：使用这个需要提前nltk.download()下载所需资源

clean_tokens = list()
sr = stopwords.words('english')

处理停用词

for token in tokens:
if token not in sr:
clean_tokens.append(token)

-- 计算频率

freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
print(str(key)+':'+str(val))

-- 画图

freq.plot(20,cumulative=False)

'''

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：Python中segments.seg函数 python seek函数

下一篇：adf+dbs架构图 adfs部署

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯