python 爬虫点击找到的A标签 python爬取网页某一个a标签

转载

mob64ca14031c97 2024-08-08 16:27:43

文章标签 python 爬虫点击找到的A标签 python爬取两个网站词云数据 string类 文章分类 Python 后端开发

关键词：requests，BeautifulSoup，jieba，wordcloud

整体思路：通过requests请求获得html，然后BeautifulSoup解析html获得一些关键数据，之后通过jieba分词对数据进行切分，去停，最后通过wordcloud画词云图

1、请求虎扑Acg区

从这里可以得知，如果我们要请求多个网页，只需要以首页作为基础url，后面的每一页在首页的url基础上进行添加即可。引入requests库进行请求

base_url = r'https://bbs.hupu.com/acg'add_url= ''content_str= ''
#尝试请求15个网页
for i in range(1, 15):if i != 1:
add_url= r'-{}'.format(i)else:
add_url= ''url= base_url +add_url
response= requests.get(url)

2、BeautifulSoup解析

打开浏览器的控制台，观察网页源码，寻找需要获得的数据的标签。我们需要获取一个帖子的标题，通过浏览网页源码可以发现帖子的标题在一个标签中，且class=“truetit”，通过这两个信息我们就可以通过BeautifulSoup获取一个帖子的标题了。

python 爬虫点击找到的A标签 python爬取网页某一个a标签_string类

base_url = r'https://bbs.hupu.com/acg'add_url= ''content_str= ''
for i in range(1, 15):if i != 1:
add_url= r'-{}'.format(i)else:
add_url= ''url= base_url +add_url
response=requests.get(url)#引入BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")#找标签，class = ‘truetit’
all_title = soup.find_all("a", class_="truetit")for title inall_title:
content_str+= title.text
需要注意的是，
all_title = soup.find_all("a", class_="truetit")

会把当前网页的所有标题都读出来，且格式是一个以标签为元素的list，通过for遍历这个list，对每一个，调用title.text即可以获得帖子的标题。

print一下，查看是不是获得了想要的结果：

python 爬虫点击找到的A标签 python爬取网页某一个a标签_词云_02

可以看到我们已经获得了我们想要的标题，下一步就是数据处理了（jieba分词+去停）

3、jieba分词+去停用词
先写一个生成停用词表的函数
#引入停用词表
defstopwordslist(filepath):
stopwords= [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]return stopwords
再进行jieba分词，去停使用的是哈工大停用词表
#用lcut使得分词后为一个list
s_list =jieba.lcut(content_str)
out_list=[]#引入停用词表
stopwords = stopwordslist(r'E:\stopwords-master\哈工大停用词表.txt')for word ins_list:if word not instopwords:if word != '\t':
out_list.append(word)
out_str= " ".join(out_list)
到这一步，就可以获得分词后的关键词了。下一步就是画词云图了。
4、画词云图
引入wordcloud，font_path是字体的路径，不导入的话可能只会显示一些框框，具体文字下载可以去网上找。mask是背景图片。generate()里的是string类型的数据。
alice_mask = plt.imread(r'D:\壁纸\huge.jpg')#generate的是string类型的
word_cloud = WordCloud(font_path='msyh.ttc',mask=alice_mask,background_color='white', max_words=400, max_font_size=80).generate(out_str)
plt.figure(figsize=(15,9))
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis('off')
plt.show()

5、结果展示

不引入mask参数：

python 爬虫点击找到的A标签 python爬取网页某一个a标签_数据_03