Day 22

一、页面滚动

1. 执行滚动操作 - 执行js中鼓动代码: window.scrollBy(x方向偏移量, y方向偏移量)

  • 再浏览一些商品网页时,商品可能未显示完全,伴随用户滚动操作会加载更多数据,这个时候我们想利用爬虫提取数据,就需要代码完成滚动操作,从而完成数据的提取。
  • 示例:某东页面在未滚动时,一页只有30个商品,然而我们完成滚动操作就可以提取到60个商品,在爬取数据时,我们不可能在自己手动操作去实现页面滚动,这时候就需要相对的代码完成工作。
from selenium.webdriver import Chrome
from time import sleep
from bs4 import BeautifulSoup

b = Chrome()
b.get('https://www.jd.com')
b.find_element_by_id('key').send_keys('电脑\n')
sleep(1)


def scroll(num):
    for i in range(num):
        for x in range(10):
            b.execute_script('window.scrollBy(0, 700)')
            sleep(1)
        soup = BeautifulSoup(b.page_source, 'lxml')
        goods_li = soup.select('#J_goodsList>ul>li')
        print(len(goods_li))
        next_btn = b.find_element_by_class_name('pn-next')
        next_btn.click()
        sleep(1)


scroll(3) # 提取前三页
input('关闭:')
b.close()

二、requests 的自动登录

1.原理

  • 自动登录原理:人工在浏览器上完成登录操作,获取登录后的cookie信息(登录信息),再通过代码发送请求的时候携带登陆后的cookie
  • 在我们爬取一些网站数据时,可能这个网站需要登陆之后才会将数据呈现,我们不可能每次爬取都去输入一次账号密码,所以我们可以在登陆网页后,利用浏览器工具提取出我们的cookie信息,并且在写爬虫时将信息写入,使其在爬取时将会自动登陆。

java selenium操作滚轮 selenium如何滚动页面_爬虫

import requests
headers = {
    'cookie': '_zap=00f8ddd9-6235-4b8c-9a59-9595c01d2126; _xsrf=aed78bfb-693b-4f9f-8ac4-de2e050da298; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1660619825; d_c0="AIBXT5M9aRWPTpTJATyFo4VHJkefNKGZsUw=|1660619825"; SESSIONID=2nqh8f9iccHNBAuIUdldfQGqEkHfwoTQ9uANQMjuNRx; JOID=W1gdA0wY0Oym2cOAcBFH9QZF9-5kI7aVye-ssRF1uqzgnIz8MSjcGM3ewoB_F1bMTbsYhkqL2zbtBp_meALIqSU=; osd=WlkdBEwZ0eyh2cKBcBZH9AdF8O5lIraSye6tsRZ1u63gm4z9MCjbGMzfwod_FlfMSrsZh0qM2zfsBpjmeQPIriU=; __snaker__id=RX9CpZC5sHyi8zMi; gdxidpyhxdE=AJEIDOMCW\3WzVDwP8/2pUPSgJJ9nN4tvVD\Kir\m7fwWsgCkeSN9uimmQ\eHp/PxTdDQT6QC04Mm1/T3In+vNR\l0Z\D+SfY8fxx68erKDaVfnQWrL5xbpSIoEf7Kw4z25wb0jEakPkx63w7I3iWlQbka+Kf9nlKZnMpYOq3yH9O+1I:1660620726180; _9755xjdesxxd_=32; YD00517437729195:WM_NI=E6XPuX1CpMe8DeTGKaLfNX392qDZdzmyftg8G/1omZuiLOVRurZjZo+3COJjpwYz2fHw88JL547fk3EL55R81ld5yfSu9AUvcEd+9X1f2Dt2ak9rvmNS09Uoz0ZOohu8YVU=; YD00517437729195:WM_NIKE=9ca17ae2e6ffcda170e2e6eeb7c95996a99c89ca54a2eb8ba6d54b969b9b82c5548ea8a58bc73e9a93b8d8e22af0fea7c3b92a87ea98b1f854f39dc0aee83488bb9ca8ca6b90ba82b9b233f2af8cbbe772afa7aa95e63485f1ff90f052b7b784b0d55ab7ae968bd965e9ed8db0c962b4a9979bb16e8ca9f8d9c9488ea7ae87ea398794fbd6d134f3b4acd6fc6e91abf7a2d87ea19bacccbc54f3bafea5f73aa3b5f793f5678e8d00dad865baf096b0ee64b7e897b7d837e2a3; YD00517437729195:WM_TID=fsO+7RDWr7lEBFAVVUOBWIw0zD6YY0/c; o_act=login; ref_source=other_https://www.zhihu.com/signin?next=/; auth_type=qqconn; atoken_expired_in=7776000; token=76D3E02F1DAF07D350615BE7496C5828; atoken=76D3E02F1DAF07D350615BE7496C5828; client_id=D63168087690CE5FAD9EF0ED8C855599; captcha_session_v2=2|1:0|10:1660619882|18:captcha_session_v2|88:OG5QbDhRd29tSFl6ZkNlUytsekp5QVVub2RuMjNTUndYQlE0NTNMU0FCWlE1ckY2dDF6SkhTSGY0eHdoL3czdA==|b47e330e33305c54ca4b9fc6368f41b9a12ef9d1c899bd1377a9057fc37e24e4; captcha_ticket_v2=2|1:0|10:1660619909|17:captcha_ticket_v2|704:eyJ2YWxpZGF0ZSI6IkNOMzFfcURMaDlkNndidS40ZXd3Y1YtbFFNcE1DZmJvbjFkT2pDR0d4OGV6NTF6X0lzZFNpaWRYbFY0NElPT3hCVW40UXducFpSYUJaYXhJa1Uwbm80ZHVyWi1vcVllSVJENS56WUF1QWh3VFhKUjVwdXpDbXY4QThQeUJoYXRiRlVSSnpjYTZ5QnoyUWRRLVFKaEpBVmhkVXR6aks2V2UySmNqQldZOEZqVC1GZm9wWVFmbDZFMXM0RE9LY2tHV3F4aC01VzRwQ2pUU1JCNDQySG03Wi5LUHRzVnp4c2NMRXlRbl9pTGhYT0xUNXRqLnVIQ1hYQ0ZPSXlraEU3dm5zYmJLUGd2Zlhlcm1WUWhMbjFYWUlMYmZmOXZ4WlBnUThwOWtWcEsyVnguSVVpQlYxUEhZSlBKZEF1Tks4Tl9lVEd2amRJbUVKa3dQLmVKX2FMOEJqSjVDSm1WVEJOaFJWeEVKLTRMN3EwNDFTOFZsdUV4UGc1Ujl0amJ6cnZOSXY1am5rMUdyNGtpVlZrakRKejVIb21fQ0NNcWlvaWpZdFRCTXpUTFVfdHBtVDFwMFRUYTBKeWZfYk1jelpkekhSVU52WjFQeW5vY0VkaXFSVWU2OThlZzRYblI0Z2NRVFhYY1pzcU91RERSOG8xT2RWb1p1WnFkbmlWV1ZDWFdTMyJ9|a6c57a8d58887ec52463eaa0d4868179b8a811eefad11671a9ad3ea72b3fa36e; z_c0=2|1:0|10:1660619961|4:z_c0|92:Mi4xcktaTE1nQUFBQUFBZ0ZkUGt6MXBGU1lBQUFCZ0FsVk51VnJvWXdERGh2eHZhNVh6bml0M2FtdTE2cncyeUNPWTBB|58bf2322a266b444afcfab6bef01f0688cd81f4e325b337e88ef72777766f2b0; q_c1=c8fb49cd510547b7a9a57719827c2407|1660619961000|1660619961000; tst=r; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1660619963; NOT_UNREGISTER_WAITING=1; KLBRSID=e42bab774ac0012482937540873c03cf|1660620011|1660619824',
    'user-agent':'Mozilla/3.0 (Windows NT 10.0; Win32; x32) AppleWebKit/537.16 (KHTML, like Gecko) Chrome/144.0.0.0 Safari/537.36'
}
response = requests.get('https://www.zhihu.com', headers=headers)
print(response.text)

三、selenium 获取和使用cookie

  • 在爬取数据过程中,些许网址在进行搜索或者其他操作时,需要登陆才能获取网页内容,这里我们可以使用cookie方法实现。
from selenium.webdriver import Chrome
from json import dumps

b = Chrome()

# 1. 打开需要完成自动登录的网站(需要获取cookie的网站)
b.get('https://www.taobao.com/')

# 2. 给足够长的时间让人工完成自动登录并且人工刷新出登录后的页面
# 强调:一定要把第一个页面刷新出登录之后的状态
input('已经完成登录:')

# 3. 获取登录后的cookie并且将获取到的cookie保存到本地文件
cookies = b.get_cookies()
print(cookies)

with open('files/taobao.txt', 'w', encoding='utf-8') as f:
    f.write(dumps(cookies))
    
    
from selenium.webdriver import Chrome
from json import loads

b = Chrome()

# 1. 打开需要自动登录网页
b.get('https://www.taobao.com/')

# 2. 添加cookie
with open('files/taobao.txt', encoding='utf-8') as f:
    content = f.read()
    cookies = loads(content)

for x in cookies:
    b.add_cookie(x)

# 3. 重新打开需要登录的网页
b.get('https://www.taobao.com/')

四、requests使用代理及实际用法

1.代理ip

  • 在我们爬去数据的过程中,可能会因为多次访问网页,过于频繁,让网页知晓我们是爬虫,从而在一段时间内禁止我们访问该网页,这个时候我么可以更换ip地址,继续进行爬取
import requests
headers = {
    'user-agent':'Mozilla/3.0 (Windows NT 10.0; Win32; x32) AppleWebKit/54.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

# ip地址
proxies = {
    'http': 'http://119.7.141.32:4531',
    'https': 'http://119.7.141.32:4531'
}
response = requests.get('https://movie.douban.com/top250', headers=headers, proxies=proxies)
print(response.text)