python电影预测评分 python电影评分推荐

转载

mob64ca13f772f3 2023-11-23 12:32:31

文章标签 python电影预测评分 python json 爬虫 request 文章分类 Python 后端开发

爬虫基本思路

1.首先发送请求并返回requests（最好模拟谷歌浏览器的头部访问（即下面的headers），并且设置一个每次访问的间隔时间，这样就不容易触发网站的反爬机制（说白了就是模拟人类的访问行为））
2.获得requests对象后使用BeautifulSoup (美丽的汤？？也不知道为啥要起这个名)来解析requests对象，注意这里要用request.text，就取文本，解析后的soup打印出来其实就是整个html的字符串内容，但是类型并不是string，应该是bs4类型，这就是这个美丽的汤的魅力所在，它可以直接在python用类似于ccs选择器那样的方式一层一层的寻找我们要的div内容。
3.搜寻soup对象中我们需要的内容，就是一层一层div找到对应的属性，然后拿取我们需要的内容。（看html或者把之前的soup对象打印出来）
4.打印或保存文件

在分析过网页之后发现传统的从html中拿前三部电影不太方便，对于json更建议从xhr中的preview获取，这样一看就一目了然了。

python电影预测评分 python电影评分推荐_爬虫

至于如何获取xhr中preview的内容，可以用如下方式：

首先看header里的url：

python电影预测评分 python电影评分推荐_python电影预测评分_02

res = requests.get(url, headers=headers,timeout=20) （假设这里我们已经获得了request对象）
首先把res转化为json对象：
js = res.json() #这样才能用键值对的方式访问到我们要的名称和url

全部代码：

import requests
from bs4 import BeautifulSoup
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
#从xhr中获取链接
url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=rank&page_limit=20&page_start=0'
res = requests.get(url, headers=headers,timeout=20) 
#print(res.status_code)
js = res.json()  #转化成json才能用键值对访问  response对象不能

def topCinema(num):  #获取评分排名前n部电影的名称和链接
    top_info = js['subjects'][:num]
    top_cinema = {}
    for i in range(num):
        top_cinema[top_info[i]['title']] = top_info[i]['url']
    return top_cinema
#print(topCinema(4))


def getComment(movieUrl,pageNum):  #爬取某个电影的第i页影评
    start = (pageNum-1) * 20
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
    url = movieUrl + 'comments?'+ 'start=' + str(start) + 'limit=20&status=P&sort=new_score'
    res = requests.get(url, headers=headers,timeout=20)
    soup = BeautifulSoup(res.text,'html.parser')
    comment_list = soup.find_all('span',class_='short')
    user = soup.find_all('span',class_='comment-info')
    cinema_comment = {}
    for i in range(len(user)):
        cinema_comment[user[i].a.string] = comment_list[i].string
    return cinema_comment
#print(getComment('https://movie.douban.com/subject/1292052/',1))

#爬取top3电影的前两页影评：（爬取多页只需要改一下参数即可）
top3 = topCinema(3)
top3_comment = {}
for name in top3:
    for i in range(1,3):
        top3_comment[name] = getComment(top3[name],i)
#print(top3_comment)

#存储本地
with open ('./comment/top3_comment.txt','w') as f:
    f.write(str(top3_comment))
    print('保存成功')
    f.close()
with open('./comment/top3_comment.txt','r') as r:
    print(r.read())
    r.close()