静态网页爬取-Requests
import requests
r=requests.get('http://www.baidu.com/')
print(r.encoding)
print(r.status_code)
print(r.text)
r.text 服务器响应的内容,会自动根据响应头部字符编码进行解码
r.encoding 服务器内容使用的文本编码
r.status_code 检测响应的状态码
r.content 字节方式的响应体
r.json() Requests中内置的JSON解码器
传递URL参数
https://so..net/so/search?q=爬虫
import requests
dict={'q':'爬虫'}
r=requests.get(params=dict)
print(r.url)
print(r.text)
定制请求头
Headers提供了关于请求、响应或其他发送实体的信息
在开发出工具中可以查看到请求头信息
设置请求头
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Host':
}
dict={'q':'爬虫'}
r=requests.get(params=dict,headers=headers)
print(r.status_code)
200表示 请求成功
POST请求
get改post
params参数改data
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Host':'so.
}
dict={'q':'爬虫'}
r=requests.post('params=dict,headers=headers)
print(r.status_code)
print(r.text)
设置超时
一般为20
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
}
r=requests.get(headers=headers,timeout=20)
print(r.status_code)
如果请求超过时间 则返回异常
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
}
r=requests.get(headers=headers,timeout=0.001)
print(r.status_code)
实战爬取豆瓣TOP电影
import requests
from bs4 import BeautifulSoup
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0',
'Host':'movie.douban.com',
'Cookie':'ll="118267"; bid=GDklS76UbIM; dbcl2="251015641:WiqhW39971k"; ck=BRDD; push_noty_num=0; push_doumail_num=0'
}
movie_list=[]
for i in range(0,10):
url='https://movie.douban.com/top250?start='+str(i*25)+'&filter='
if i==0:
url = 'https://movie.douban.com/top250'
r=requests.get(url=url,headers=headers)
soup=BeautifulSoup(r.text,"lxml")
list=soup.find_all('div','hd')
for i in list:
m=i.find_all('span')
movie_list.append(m[0].text)
print(movie_list)