1.爬取流程
1.1 接口导入
我们以demo.py为基础进行爬取
我们要爬取的网站是https://spa1.scrape.center/ 但我们发现它使用接口,数据不在页面中,在接口中
https://spa1.scrape.center/api/movie?limit=10&offset=0
创建Scrapy文件:
找到目标文件夹:scrapy startproject Movie
进行创建的Moive project:cd Moive
进行部署:scrapy genspider example example.com
example和example.com可以随便写,进入项目时可更改;
moive.py代码更改如下
import scrapy
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['spa1.scrape.center']
# start_urls = ['http://spa1.scrape.center/']
def start_requests(self):
# 将从demo.py中获取的接口网页连接加入,列表生成式
# range(0 ,101, 10) 取0-100 间隔是10
web_url = ['https://spa1.scrape.center/api/movie?limit=10&offset={}'.format(page) for page in range(0, 91, 10)]
for i in web_url:
# scrapy.Request()参数一必须为str
yield scrapy.Request(i, self.get_content)
def get_content(self, response):
print(response.url)
更改setting.py文件:
Scrapy框架的运行需要在终端输入:scrapy crawl 项目名
项目名就是movie.py文件中的name变量
为了执行方便,可以在该文件位置,创建一个run.py如下:
在run.py中写执行语句,之后只要运行run.py就可以启动框架:
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'movie'])
1.2 数据提取
查看response.text:发现为Json数据
{"count":100,"results":[{"id":41,"name":"萤火之森","alias":"蛍火の杜へ","cover":"https://p1.meituan.net/movie/4c55f3bf5fa9660db3cb7014651a0950267034.jpg@464w_644h_1e_1c","categories":["剧情","爱情","动画","奇幻"],"published_at":"2011-09-17","minute":45,"score":8.8,"regions":["日本"]},{"id":42,"name":"素媛","alias":"소원","cover":"https://p0.meituan.net/movie/19653e8af59cf473cd40f9ccc0658d93692304.jpg@464w_644h_1e_1c","categories":["剧情"],"published_at":"2013-10-02","minute":123,"score":8.8,"regions":["韩国"]},{"id":43,"name":"小鞋子","alias":"بچههای آسمان","cover":"https://p1.meituan.net/movie/135c612860fae899df2220149664d97a173555.jpg@464w_644h_1e_1c","categories":["剧情","家庭"],"published_at":null,"minute":89,"score":8.8,"regions":["伊朗"]},{"id":44,"name":"熔炉","alias":"도가니","cover":"https://p1.meituan.net/movie/2a0783b4fd95566568f24adfad2181bb5392280.jpg@464w_644h_1e_1c","categories":["剧情"],"published_at":"2011-09-22","minute":125,"score":8.8,"regions":["韩国"]},{"id":45,"name":"大话西游之大圣娶亲","alias":"A Chinese Odyssey Part Two - Cinderella","cover":"https://p1.meituan.net/moviemachine/508056769092059fe43a611b949f27d14863831.jpg@464w_644h_1e_1c","categories":["喜剧","爱情","奇幻"],"published_at":"2014-10-24","minute":110,"score":8.9,"regions":["中国香港","中国大陆"]},{"id":46,"name":"新龙门客栈","alias":"New Dragon Gate Inn","cover":"https://p1.meituan.net/movie/7833126c8c21a11571bb52fbdece0acb811449.jpg@464w_644h_1e_1c","categories":["动作","爱情","武侠","古装"],"published_at":"2012-02-24","minute":88,"score":8.9,"regions":["中国香港","中国大陆"]},{"id":47,"name":"触不可及","alias":"Intouchables","cover":"https://p1.meituan.net/movie/1e700e53e4fe29dd5942381bb353c8532239179.jpg@464w_644h_1e_1c","categories":["剧情","喜剧"],"published_at":"2011-11-02","minute":112,"score":8.9,"regions":["法国"]},{"id":48,"name":"钢琴家","alias":"The Pianist","cover":"https://p0.meituan.net/movie/bcbe59fc51580317adf94537a61a1a26142090.jpg@464w_644h_1e_1c","categories":["剧情","音乐","传记","历史","战争"],"published_at":"2002-05-24","minute":150,"score":8.9,"regions":["法国","德国","英国","波兰"]},{"id":49,"name":"本杰明·巴顿奇事","alias":"The Curious Case of Benjamin Button","cover":"https://p0.meituan.net/movie/2526f77c650bf7cf3d5ee2dccdeac332244951.jpg@464w_644h_1e_1c","categories":["剧情","爱情","奇幻"],"published_at":"2008-12-25","minute":166,"score":8.9,"regions":["美国"]},{"id":50,"name":"倩女幽魂","alias":"A Chinese Ghost Story","cover":"https://p1.meituan.net/movie/96d98200d2afb4b87ff189f9c15b6545568339.jpg@464w_644h_1e_1c","categories":["爱情","奇幻","武侠","古装"],"published_at":"2011-04-30","minute":98,"score":8.9,"regions":["中国香港"]}]}
添加Json库:import json
执行以下语句:
# 将接口中提取的Json数据转为dict字典
source = json.loads(response.text)
print(source)
{'count': 100, 'results': [{'id': 71, 'name': '当幸福来敲门', 'alias': 'The Pursuit of Happyness', 'cover': 'https://p1.meituan.net/movie/7d1d85610651dbe1c8687781a87d1008184950.jpg@464w_644h_1e_1c', 'categories': ['剧情', '家庭', '传记'], 'published_at': '2008-01-17', 'minute': 117, 'score': 8.9, 'regions': ['美国']}, {'id': 72, 'name': '幽灵公主', 'alias': 'もののけ姫', 'cover': 'https://p0.meituan.net/movie/a08f65e6cb50fab32df5da69ff116f593095363.jpg@464w_644h_1e_1c', 'categories': ['动画', '奇幻', '冒险'], 'published_at': '1998-05-01', 'minute': 134, 'score': 8.9, 'regions': ['日本']}, {'id': 73, 'name': '十二怒汉', 'alias': '12 Angry Men', 'cover': 'https://p0.meituan.net/movie/df15efd261060d3094a73ef679888d4f238149.jpg@464w_644h_1e_1c', 'categories': ['剧情'], 'published_at': '1957-04-13', 'minute': 96, 'score': 8.9, 'regions': ['美国']}, {'id': 74, 'name': '搏击俱乐部', 'alias': 'Fight Club', 'cover': 'https://p0.meituan.net/movie/b3defc07dfaa1b6f5b74852ce38a3f8f242792.jpg@464w_644h_1e_1c', 'categories': ['剧情', '动作', '悬疑', '惊悚'], 'published_at': '1999-09-10', 'minute': 139, 'score': 8.9, 'regions': ['美国', '德国']}, {'id': 75, 'name': '疯狂原始人', 'alias': 'The Croods', 'cover': 'https://p1.meituan.net/movie/bc022b86345c643ca21d759166f77a553679589.jpg@464w_644h_1e_1c', 'categories': ['喜剧', '动画', '冒险'], 'published_at': '2013-04-20', 'minute': 98, 'score': 8.9, 'regions': ['美国']}, {'id': 76, 'name': '阿凡达', 'alias': 'Avatar', 'cover': 'https://p1.meituan.net/movie/e540384dc6c9f63bdb27cc554588a77f44305.jpg@464w_644h_1e_1c', 'categories': ['动作', '科幻', '冒险'], 'published_at': '2010-01-04', 'minute': 162, 'score': 8.9, 'regions': ['美国', '英国']}, {'id': 77, 'name': '哈尔的移动城堡', 'alias': 'ハウルの動く城', 'cover': 'https://p0.meituan.net/movie/0127b451d5b8f0679c6f81c8ed414bb2432442.jpg@464w_644h_1e_1c', 'categories': ['动画', '奇幻', '冒险'], 'published_at': '2004-09-05', 'minute': 119, 'score': 8.9, 'regions': ['日本']}, {'id': 78, 'name': '盗梦空间', 'alias': 'Inception', 'cover': 'https://p1.meituan.net/movie/d40efe1183f29d5900f5c60be3c8a89d339225.jpg@464w_644h_1e_1c', 'categories': ['剧情', '科幻', '悬疑', '冒险'], 'published_at': '2010-09-01', 'minute': 148, 'score': 8.9, 'regions': ['美国', '英国']}, {'id': 79, 'name': '忠犬八公的故事', 'alias': "Hachi: A Dog's Tale", 'cover': 'https://p0.meituan.net/movie/5f0a709378d6b567807aa9685610f818282136.jpg@464w_644h_1e_1c', 'categories': ['剧情'], 'published_at': '2009-06-13', 'minute': 93, 'score': 8.9, 'regions': ['美国', '英国']}, {'id': 80, 'name': '拯救大兵瑞恩', 'alias': 'Saving Private Ryan', 'cover': 'https://p1.meituan.net/movie/a2a287c77415dc1f85b04d288f7d63ab1089754.jpg@464w_644h_1e_1c', 'categories': ['剧情', '历史', '战争'], 'published_at': '1998-11-13', 'minute': 169, 'score': 8.9, 'regions': ['美国']}]}
完整代码:
import scrapy
import json
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['spa1.scrape.center']
# start_urls = ['http://spa1.scrape.center/']
def start_requests(self):
# 将从demo.py中获取的接口网页连接加入,列表生成式
web_url = ['https://spa1.scrape.center/api/movie?limit=10&offset={}'.format(page) for page in range(0, 91, 10)]
for i in web_url:
yield scrapy.Request(i, self.get_content)
def get_content(self, response):
source = json.loads(response.text)
for i in source['results']:
name = i['name']
categories = i['categories']
score = i['score']
print(name, categories, score)
这样就可以获取到页面中的电影名,电影分类,电影评分