本文针对有一定网络爬虫基础的读者,通过阅读本文快速复习网络请求相关操作。
不带参数的get请求
>>> import requests
>>> response = requests.get('http://www.baidu.com')#get请求
>>> response.encoding = 'utf-8' # 对响应结果进行UTF-8编码
>>> response.status_code #响应状态码
200
>>> response.url#请求的网址
'http://www.baidu.com/'
>>> response.headers#响应头
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 08 Dec 2020 15:23:25 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
>>> response.cookies#cookies
>>> response.content #字节流形式、bytes类型数据
>>> response.text#输出内容
带参数的get请求
方法一
直接把参数写进url
,参数与url
之间用?
分隔,多个参数间用&
隔开。
#带参数的请求方式
>>> import requests
>>> response = requests.get('http://httpbin.org/get?name=Jim&age=18')#直接把参数写进url
>>> print(response.text)
{
"args": {
"age": "18",
"name": "Jim"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"origin": "119.123.196.0, 119.123.196.0",
"url": "https://httpbin.org/get?name=Jim&age=18"
}
网站 'http://httpbin.org/get' 作为一个网络请求站点使用,可以模拟各种网站请求操作。
方法二
将参数填写在dict
中,发起请求时params
参数指定为dict
。
>>> data = {'name': 'tom','age': 20}
>>> response = requests.get('http://httpbin.org/get', params=data)
>>> print(response.text)
{
"args": {
"age": "20",
"name": "tom"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"origin": "119.123.196.0, 119.123.196.0",
"url": "https://httpbin.org/get?name=tom&age=20"
}
#百度中搜索多个关键字
>>> words=['python','java','c','matlab']
>>> for word in words:
... data = {'wd':word}
... response = requests.get('http://www.baidu.com/s?', params=data)
... print(response.url)
http://www.baidu.com/s?wd=python
http://www.baidu.com/s?wd=java
http://www.baidu.com/s?wd=c
http://www.baidu.com/s?wd=matlab
post请求
#post请求,用post方法
>>> import requests
>>> data = {'name':'jim','age':'18'}
>>> response = requests.post('http://httpbin.org/post', data=data)
>>> print(response.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"age": "18",
"name": "jim"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "15",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"json": null,
"origin": "119.123.196.0, 119.123.196.0",
"url": "https://httpbin.org/post"
}
>>> response.url
'http://httpbin.org/post'
post请求
参数data
可以是列表、元组、字典或Json格式
。注意
get
与post
的参数分别为params
和data
。
请求头headers 与验证Cookies
使用谷歌浏览器,打开
'https://www.baidu.com/'
后按F12
调出开发者工具。
Cookies
信息像很多网页中自动登录一样,在用户第二次登录时,不需要再次属于用户名和密码即可实现登陆操作。
在使用requests模块
实现Cookies
登录时,首先找到Cookies信息
,然后将Cookies信息
处理并添加至RequestsCookiesJar对象
中,并将RequestsCookiesJar对象
作为网络请求的Cookies参数
发送至网络。
以下演示如何处理Cookies信息
。
>>> import requests
>>> url='https://www.baidu.com'
>>> header={'User-Agent':'此处填写User-Agent信息',
'Referer': 'https://www.google.com.hk/',
'Host': 'www.baidu.com'}
>>> cookies = '此处填写Cookie信息'
>>> cookies_jar = requests.cookies.RequestsCookiesJar()
>>> for cookie in cookies.split(';'):
... key, value = cookie.split('=', 1)
... cookies_jar.set(key, value) # 将cookies内存保存在RequestsCookiesJar中
>>> response = requests.get(url, headers=header, cookies=cookies_jar)
# 略
Session会话请求
功能相当于浏览器打开新的选项卡,第一次请求带登录信息,第二次请求是在第一次请求带基础上,不需要带有Cookies登录信息
的情况下获取登录后的页面信息。
>>> import requests
>>> s = requests.Session()
>>> data = {'username': '云朵', 'password': '云朵'}
>>> response = s.get(url, headers=header, cookies=cookies_jar)
>>> response2 = s.get(url)
auth参数实现验证登录
有时候访问网页时需要登录用户名和密码后才可以访问,requests模块
处理如下
>>> import requests
>>> from requests.auth import HTTPBasicAuth
>>> url='https://www.baidu.com'
>>> ah = HTTPBasicAuth('云朵', 'STUDIO') #创建HTTPBasicAuth对象,参数为用户名和密码
>>> response = requests.get(url=url, auth=ah)
>>> if response.status_code == 200:
... print(response.text)
识别网络异常分类
requests
提供了三种常见的网络异常类用以捕获异常。
>>> import requests
>>> # 导入三种异常类
>>> from requests.exceptions import ReadTimeout, HTTPError, RequestException
>>> for _ in range(0, 11):
... try:
... # 网络请求代码
... url='https://www.baidu.com'
... # 设置网络超时为0.1
... response = requests.get(url=url, timeout=0.1)
... print(response.status_code)
... except ReadTimeout:
... print('readtimeout')
... except HTTPError:
... print('httprror')
... except RequestException:
... print('reqerror')
设置代理
在爬取网页过程中,经常会出现IP被网站服务器屏蔽,此时可使用代理服务解决此麻烦。
代理地址格式IP:端口号
如168.193.1.1:5000
。
>>> import requests
>>> header={'User-Agent':'此处填写User-Agent信息'}
# 设置代理IP与对应的端口号
>>> proxy = {'http': 'http://117.88.176.38:3000',
... 'https': 'https://117.88.176.38:3000',}
>>> try:
... # 发送请求,verify=False意为不验证服务器的SSL证书
... response = requests.get('http://202020.ip138.com',headers=header,
... proxies=proxy,verify=False)
... print(response.status_code) # 打印状态码
>>> except Exception as e:
... print('Error information is', e) # 打印异常信息
Error information is HTTPConnectionPool(host='117.88.176.38', port=3000): Max retries exceeded with url: http://202020.ip138.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11313cf10>: Failed to establish a new connection: [Errno 61] Connection refused')))
快速免费获取代理
获取免费代理IP,先找到提供免费代理网页,再通过爬虫技术爬取并保存至文件中。
import requests
from lxml import etree
import pandas as pd
ip_list = []
def get_ip(url, headers):
response = requests.get(url, headers=headers)
response.encoding='utf-8'
if response.status_code == 200:
html = etree.HTML(response.text) # 解析HTML
# 获取所有带有IP的标签
li_all = html.xpath('//li[@class="f-list col-lg-12 col-md-12 col-sm-12 col-xs-12"]')
for i in li_all:
ip = i.xpath('span[@class="f-address"]/text()')[0] # 获取IP
port = i.xpath('span[@class="f-port"]/text()')[0] # 获取端口
ip_list.append(ip+':'+port) # 将IP与端口组合并添加到列表中
print("agency ip is:", ip, 'port number is:', port)
headers= {'User-Agent':'此处填写User-Agent信息'}
if __name__ =='__main__':
ip_table = pd.DataFrame(columns=['ip'])
for i in range(1, 5):
url = 'https://www.dieniao.com/FreeProxy/{page}.html'.format(page=i)
get_ip(url, headers)
ip_table['ip'] = ip_list
ip_table.to_excel('ip.xlsx', sheet_name='ip_port')
控制台结果如下:
agency ip is: 122.5.109.19 port number is: 9999
agency ip is: 122.5.108.245 port number is: 9999
agency ip is: 171.35.147.86 port number is: 9999
agency ip is: 171.35.171.111 port number is: 9999
agency ip is: 118.212.106.113 port number is: 9999
...
检测代理IP是否有效
并不是所有免费代理IP均是有效的,当然可以使用付费代理IP,通常是有效的。
如何检测所获取的免费代理IP,其通常方法为读取免费代理IP文件,遍历并使用其发送网络请求,若请求成功,则说明此免费代理IP是有效的。
import requests
import pandas as pd
from lxml import etree
ip_table = pd.read_excel('ip.xlsx')
ip = ip_table['ip']
headers= {'User-Agent':'此处填写User-Agent信息'}
# 循环遍历代理IP并发送网络请求
for i in ip:
proxies = {'http': 'http://{ip}'.format(ip=i),
' https': 'https://{ip}'.format(ip=i)}
try:
response = resquests.get('https://site.ip138.com/',
headers=headers, proxies=proxies, timout=2)
if response.status_code == 200:
response.encoding='utf-8'
html = etree.HTML(response.text)
info = html.xpath('/html/body/p[1]/text()')
print(info)
except Exception as e:
pass
# print('Error information is:', e)
可以将上述两个步骤合并,在获取免费代理IP后,即刻进行验证,只保存那些有效的免费代理IP,供后续使用。
小示例
提取链家房源面积和价格并排序
#完整代码
import requests
import re
import pandas as pd
url='https://sz.lianjia.com/zufang/'
header={'User-Agent':''}
response = requests.get(url, headers=header)
html=response.text
nameregex=re.compile(r'alt="(.*?)" \n')
name = nameregex.findall(html) #找出所有小区名字
arearegex=re.compile(r'([0-9.]+)㎡\n')
area = arearegex.findall(html) #找出面积
priceregex=re.compile(r'<em>([0-9.]+)</em> 元/月')
price = priceregex.findall(html) #找出价格
datalist=[]
for i in range(len(name)):
datalist.append([name[i],float(area[i]),float(price[i])])
df=pd.DataFrame(datalist,columns=['name','area','price']).set_index('name')
df.sort_values('area')#对面积进行排序
df.sort_values('price')#对价格进行排序
-- 数据STUDIO --