这次练习的是抓取动态网页,因为个人喜欢恐怖片,就选了豆瓣的恐怖片来作为爬取对象。

python采集豆瓣 python爬取豆瓣_python采集豆瓣

python采集豆瓣 python爬取豆瓣_doubanmovies_02

网页是动态加载的,点击加载更多就出现更多的信息。所以需要在浏览器用F12工具中打开network,找到XHR,观察加载的内容。

python采集豆瓣 python爬取豆瓣_python_03

通过观察Headers里的Request URL,知道了返回信息的url,点击几次加载更多,会发现url:https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1,%E6%81%90%E6%80%96&start=0

每次改变的只是末尾start的值,每次增多20,20代表的就是每次加载电影的个数,可以数一数来确定。这样就确定了爬取页面的url。


接下来切换到Preview项,观察这个json格式的字典,确定想要爬取的信息,比如id,title,rate等等。

python采集豆瓣 python爬取豆瓣_doubanmovies_04


但是我发现这样获取的信息不够多,我还想知道电影的上映时间,导演,演员,电影时长。得不到这些信息让我的这个爬虫一直没有写成,直到有一次我逛豆瓣的其他分类电影,发现把鼠标光标停留在某些电影图片上时,会出现更加详细的信息,如下图。打开F12,查看XHR,果然这些信息也是通过XHR异步加载,也是json格式且更加详细。

python采集豆瓣 python爬取豆瓣_doubanmovies_05


python采集豆瓣 python爬取豆瓣_doubanmovies_06

python采集豆瓣 python爬取豆瓣_python_07

如图所示,观察下发现这类url需要得知电影的id才能返回相应的信息,而获取id十分简单,直接解析前面获得的json即可。剩下的问题就是如何快速爬取和不被封掉ip地址。根据上次学习的multiprocessing,利用多线程达到快速爬取的目的。而更换ip地址的问题,可先写一个抓取”快代理“免费ip地址的爬虫,把抓到的ip用于我们的豆瓣爬虫上。

下面上代码:

import requests
import json
from multiprocessing import Pool
import pymysql
import random


def get_ip():#把爬下来的ip地址存到文件中,调用储存在文件中的IP地址,转换为字典形式
    proxies = {}
    with open('ip池.txt','r') as r:
        datas = r.readlines()
        data = datas[random.randint(0,11)]#利用random随机数,随机提取IP
        key = data[:6].strip("'")
        value = data[7:].strip("'").strip('\n').rstrip("'")
        proxies[key] = value#转换为字典
        r.close()
        return(proxies)


def get_id(page):#获取电影的id
    db = pymysql.connect('localhost','你的用户名','你的密码','你的数据库名',charset='utf8')#调用pymysql储存数据
    cursor = db.cursor()
    url = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1,%E6%81%90%E6%80%96&start={0}'.format(page*20)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
    try:#用try语句处理异常,防止出现IP被封和爬虫假死的情况
        proxies = get_ip()
        js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
    except requests.exceptions.ConnectTimeout:
        js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
    except:
        proxies = get_ip()
        js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
    try:
        for i in range(20):
            data = js['data'][i]
            title = data['title']
            id = data['id']
            score, year, duration, directors, actors = get_detail(id)
            print('正在储存{0},使用的ip是{1}'.format(title,proxies['HTTP']))#可以看到程序正在执行的过程
            sql = 'insert into horror values (%s,%s,%s,%s,%s,%s)'
            cursor.execute(sql,(str(title),score,year,duration,directors,actors))
            db.commit()
    except IndexError:
        print('无内容')
    db.close()

def get_detail(id):#根据电影的id获取更详细的电影信息
    url = 'https://movie.douban.com/j/subject_abstract?subject_id={0}'.format(id)
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
    try:
        proxies = get_ip()
        js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
    except requests.exceptions.ConnectTimeout:
        js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
    except:
        proxies = get_ip()
        js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
    data = js['subject']
    score = data['rate']
    try:
        year = data['release_year']
    except:
        year = 'unknown'
    duration = data['duration']
    try:
        directors = data['directors'][0]
    except:
        directors = 'unknown'
    try:
        actors = data['actors'][:4]
    except:
        actors = data['actors']
    return str(score),year,duration,directors,str(actors).strip('[').strip(']')


def main(i):
    get_id(i)

if __name__ == '__main__':
    i = [i for i in range(550)]#调用多线程
    pool = Pool(8)
    pool.close()
    pool.join()
    pool.map(main,i)

最终储存到数据库中:

python采集豆瓣 python爬取豆瓣_爬虫_08





补上个爬取免费ip地址的爬虫(需要设置sleep,速度太快爬取的ip地址个数会爬漏很多,网站可能限制了爬虫的访问):



import requests
from lxml import etree
import time

def get_ip(url):
    headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
    res = requests.get(url,headers = headers).text
    s = etree.HTML(res)
    for i in range(1,16):
        ip = s.xpath('//*[@id="list"]/table/tbody/tr[{0}]/td[1]/text()'.format(i))
        port = s.xpath('//*[@id="list"]/table/tbody/tr[{0}]/td[2]/text()'.format(i))
        htt = s.xpath('//*[@id="list"]/table/tbody/tr[{0}]/td[4]/text()'.format(i))
        proxies = {'{0}'.format(htt[0]):'{0}://{1}:{2}'.format(htt[0],ip[0],port[0])}
        url2 = 'https://www.baidu.com/'#用于测试爬到的ip是否能用
        ress = requests.get(url2,proxies = proxies)
        if ress.status_code == 200:
            print('ok,正在储存{0}'.format(ip[0]))
            with open('ip池.txt','a') as f:
                f.write('\'{0}\':\'http://{1}:{2}\'\n'.format(htt[0],ip[0],port[0]))
                f.close()
        else:
            print('{0}不可用'.format(ip[0]))


def main():
    with open('ip池.txt','w') as w:
        w.seek(0)#从0开始定位
        w.truncate()#配合seek清除文件全部内容,用于更新IP池
        w.close()
    for i in range(1,11):
        url = 'https://www.kuaidaili.com/free/inha/{0}'.format(i)
        get_ip(url)
        time.sleep(1)

if __name__ == '__main__':
    main()