静态网页爬取-Requests

import requests
r=requests.get('http://www.baidu.com/')
print(r.encoding)
print(r.status_code)
print(r.text)

静态网页爬取-Requests_firefox

r.text 服务器响应的内容,会自动根据响应头部字符编码进行解码

r.encoding 服务器内容使用的文本编码

r.status_code 检测响应的状态码

r.content 字节方式的响应体

r.json() Requests中内置的JSON解码器

传递URL参数

https://so..net/so/search?q=爬虫
import requests
dict={'q':'爬虫'}
r=requests.get(params=dict)
print(r.url)
print(r.text)

静态网页爬取-Requests_python_02

定制请求头

Headers提供了关于请求、响应或其他发送实体的信息

静态网页爬取-Requests_firefox_03

在开发出工具中可以查看到请求头信息

设置请求头

import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Host':
}
dict={'q':'爬虫'}
r=requests.get(params=dict,headers=headers)
print(r.status_code)

静态网页爬取-Requests_爬虫_04

200表示 请求成功

POST请求

get改post

params参数改data

import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Host':'so.
}
dict={'q':'爬虫'}
r=requests.post('params=dict,headers=headers)
print(r.status_code)
print(r.text)

设置超时

一般为20

import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
}
r=requests.get(headers=headers,timeout=20)
print(r.status_code)

静态网页爬取-Requests_请求头_05

如果请求超过时间 则返回异常

import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
}
r=requests.get(headers=headers,timeout=0.001)
print(r.status_code)


静态网页爬取-Requests_python_06

实战爬取豆瓣TOP电影

import requests
from bs4 import BeautifulSoup
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Host':'movie.douban.com',
'Cookie':'ll="118267"; bid=GDklS76UbIM; dbcl2="251015641:WiqhW39971k"; ck=BRDD; push_noty_num=0; push_doumail_num=0'
}
movie_list=[]
for i in range(0,10):
url='https://movie.douban.com/top250?start='+str(i*25)+'&filter='
if i==0:
url = 'https://movie.douban.com/top250'
r=requests.get(url=url,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
list=soup.find_all('div','hd')
for i in list:
m=i.find_all('span')
movie_list.append(m[0].text)
print(movie_list)

静态网页爬取-Requests_python_07