python爬虫猫眼演唱会爬虫爬取猫眼电影

转载

人类新新 2024-06-03 09:05:55

文章标签 python爬虫猫眼演唱会 python 爬虫正则表达式正则 文章分类 Python 后端开发

本文用正则、xpath、beautifulsoup、css、pyquery几种不同的方式，爬取猫眼电影。只是记录过程。比较乱。

猫眼电影现在也添加了一些反爬虫机制，如果直接用requests可能会403.所以最好添加header 和cookies。

添加的方法是使用网页的自动生成请求。

浏览器登陆，直接百度搜。

python爬虫猫眼演唱会爬虫爬取猫眼电影_python

点击榜单

点击top100

python爬虫猫眼演唱会爬虫爬取猫眼电影_python_02

出来页面之后，点击检查按钮，调出开发者工具。

python爬虫猫眼演唱会爬虫爬取猫眼电影_爬虫_03

选择network选项卡，然后在页面上，点击右键弹出“重新加载”，有的浏览器可能是“刷新”都差不多。

刷出来4这个页面，显示它是一个document，状态码是200.

选中4，右键弹出copy

python爬虫猫眼演唱会爬虫爬取猫眼电影_正则表达式_04

选择copy as curl(bash)。这样网络命令就拷贝到你的剪切板里了。

在进入这个网站，把剪切板里的东西粘贴到左边栏，就可以自动生成python请求语法。

https://curl.trillworks.com/

python爬虫猫眼演唱会爬虫爬取猫眼电影_正则_05

右边的代码直接拷贝到pycharm里。就可以生成response了。这样免去了自己写，复制粘贴cookies。省很多事，也不会出错。

python爬虫猫眼演唱会爬虫爬取猫眼电影_正则表达式_06

可以看到，粘过来，自动会带有header， cookies

运行一下，就可以得到response

python爬虫猫眼演唱会爬虫爬取猫眼电影_python_07

可以打印出response.txt。太长了，不截屏了。

为了方便，我们还是把这个文件存到本地，方便练习解析库。

wk_dir = '（你的路径）'
import os

with open(os.path.join(wk_dir, 'maoyan.txt'), 'w', encoding = 'utf-8') as f:
    f.write(response.text)
    print("saved")

这样就保存了

python爬虫猫眼演唱会爬虫爬取猫眼电影_爬虫_08

如果不关电脑，一直用response.txt也行。

html = response.text    
html

首先是正则。

选取一段代码研究，

python爬虫猫眼演唱会爬虫爬取猫眼电影_正则表达式_09

电影的名字是在title后面

如果检查元素的话，可以发现

python爬虫猫眼演唱会爬虫爬取猫眼电影_python爬虫猫眼演唱会_10

在“title" 和 ”data-act"之间。

打开这个网址，试一下：

https://c.runoob.com/front-end/854/

python爬虫猫眼演唱会爬虫爬取猫眼电影_正则表达式_11

电影名字，就是正则表达式里面，括号里的内容，也是我们想提取的信息。

搜索电影评分

用

class="score".*?integer">(.*?)</i

python爬虫猫眼演唱会爬虫爬取猫眼电影_python爬虫猫眼演唱会_12

这样就可以得到需要的信息。用python把正则表达式写下来，就是。

这里我们首先选择每一部电影的源代码片段。观察发现，源代码片段是dd开头的。

python爬虫猫眼演唱会爬虫爬取猫眼电影_正则_13

所以在每一个dd里面。搜索正则。

这样不会乱。

for item in item_list:
    index = re.search(r"board-index-.*?>(\d.*?)</i", item)
    print(index.group(1))
    image = re.search(r'data-src.*?"(.*?)"', item)
    print(image.group(1))
    title = re.search(r'class="name".*?title="(.*?)"', item)
    print(title.group(1))
    actor = re.search(r'class="star">.*?(.*?)</p',item,  re.S)
    print(actor.group(1).strip())
    time = re.search(r'class="releasetime">(.*?)</p>', item, re.S)
    print(time.group(1).strip())
    score1 = re.search(r'class="integer">(\d)', item)
    score2 = re.search(r'class="fraction">(\d)', item)
    score = score1.group(1) +"." +score2.group(1)
    print(score)

这样就可以得到结果。

python爬虫猫眼演唱会爬虫爬取猫眼电影_python爬虫猫眼演唱会_14