Headless Chrome是无头Chrome浏览器,可以利用Chrome V8引擎的高效。
可以代替phantomjs,Scrapy也不建议使用phantomjs了。
启用无头Chrome,必须使用Chrome对应版本的WebDriver。

准备

windows10
Anaconda3
python 3.6.2
Selenium
WebDriver
Scrapy

下载WebDriver:​​https://sites.google.com/a/chromium.org/chromedriver/downloads​

Scrapy和Headless Chrome采集动态网站数据_selenium

实现

首先创建Scrapy项目:

scrapy startproject xintong

创建一个Spider:

# -*- coding: utf-8 -*-
import scrapy


class WeiboSpider(scrapy.Spider):
name = 'weibo'
allowed_domains = ['weibo.com']
start_urls = ['https://m.weibo.cn/u/2937210565']

def parse(self, response):
print("返回渲染过的页面内容")
for sel in response.xpath('//div[@id="app"]//div[contains(@class, "card9")]'):
title = sel.xpath('.//div[@class="weibo-text"]/text()').extract_first()
print('标题:', title)

配置中间件:

# -*- coding: utf-8 -*-
from selenium import webdriver
from scrapy.http import HtmlResponse
import time
from selenium.webdriver.chrome.options import Options


class XintongSpiderMiddleware(object):

def __init__(self):
option = Options()
option.add_argument('--headless')
self.driver = webdriver.Chrome(executable_path="D:/Python/test3/chromedriver.exe",
chrome_options=option)

def process_request(self, request, spider):
self.driver.get(request.url)
print("页面开始渲染。。。")
self.driver.execute_script("scroll(0, 1000);")
time.sleep(1)
rendered_body = self.driver.page_source
print("页面完成渲染。。。")
return HtmlResponse(request.url, body=rendered_body, encoding="utf-8")

def spider_closed(self, spider, reason):
print('驱动关闭')
self.driver.close()

配置setting.py,修改如下配置

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None, # 关闭默认下载器
'xintong.middlewares.XintongSpiderMiddleware': 543,
}

运行结果:

scrapy crawl weibo
2018-05-31 22:43:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-31 22:43:06 [scrapy.core.engine] INFO: Spider opened
2018-05-31 22:43:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-31 22:43:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-31 22:43:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/url {"url": "https://m.weibo.cn/u/2937210565", "sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
页面开始渲染。。。
2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/execute {"script": "scroll(0, 1000);", "args": [], "sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
2018-05-31 22:43:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
[0531/224307.637:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://tva2.sinaimg.cn will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://m.weibo.cn/u/2937210565 (0)
[0531/224307.653:INFO:CONSOLE(0)] "The SSL certificate used to load resources from https://tva1.sinaimg.cn will be distrusted in the future. Once distrusted, users will be prevented from loading these resources. See https://g.co/chrome/symantecpkicerts for more information.", source: https://m.weibo.cn/u/2937210565 (0)
2018-05-31 22:43:08 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54161/session/60af686269ddbb5b2f03ea11de7afb1b/source {"sessionId": "60af686269ddbb5b2f03ea11de7afb1b"}
2018-05-31 22:43:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
页面完成渲染。。。
2018-05-31 22:43:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://m.weibo.cn/u/2937210565> (referer: None)
返回渲染过的页面内容
标题: 把您的答案留下来吧 ~
标题: 突然看到了很早之前吃过的零食,是不是暴露年龄了
标题: 【二哈撞伤骑车大爷逃走,主人赔三万多】5月21日,江苏淮安。一条哈士奇从主人电动车掉下后,狂奔追赶,撞上骑电动车的马大爷。大爷摔倒骨折,哈士奇“肇事逃逸”。交警追踪找到狗主人,大爷获赔33200元。
标题: 所有收费站恢复通行
标题: 各位老师er吃完晚饭了吗?要不要出来遛弯跟蜀黍偶遇一波
标题: 【网红化妆品真能“火”!靠近火焰“蹭蹭”往上蹿火?】注意啦!“网红防晒喷雾、杀虫喷雾、非常易燃助燃,一瓶花露水的酒精含量就达70%。”5月30日,扬州一个消防实验 让我们了解:一些化妆品如果使用不当,很有可能起不到保护皮肤的作用,反而会引发火灾,造成烧伤烫伤。
标题: 机场收费站恢复通行!
标题: 据说出去玩的时候,每个位置职责都不一样!你是哪一个?
标题: 做好笔记了!汽车常见易损零部件的常识![并不简单]
标题: 这是什么黑科技?不过我更关注下雨咋办?
2018-05-31 22:43:08 [scrapy.core.engine] INFO: Closing spider (finished)

参考:

​https://developers.google.com/web/updates/2017/04/headless-chrome​​​
​​​https://docs.scrapy.org/en/latest/​​​
​​​https://intoli.com/blog/installing-google-chrome-on-centos/​