带你用Python爬取代理
第一步
导入库:
import requests,xml.etree.ElementTree as ET
说明:
Requests:请求库,用于请求API网址
xml.etree.ElementTree:用于解析返回值时,解析XML数据
第二步
构造请求参数
Arguments={
"https":input("是否支持HTTPS,0,不限;1,HTTPS代理,请输入:"),
"type":input("代理类型,0,不限;1,透明代理;2,匿名代理;3,高匿代理,请输入:"),
"format":input("返回格式,text,文本;json,JSON;xml,XML,请输入:"),
"token":你的Token
}
注意:
没有token的请去
proxy.newday.me注册用户
然后,把账号token值填入这里
第三步
判断用户输入的返回格式:
先判断json
if Arguments["format"]=="json":
开始请求:
Response=requests.get("http://api.newday.me/proxy/extract",Arguments).json()
说明:
Arguments:这是前面定义的传参
json:获取json格式的数据
第四步
解析返回值(json数据)
Data=Response["data"]["list"]
Datum=[]
for i in range(len(Data)):
Datum.append(Data[i])
说明:
Response:是上面请求的返回值对象
Data:用于存储所有代理数据
Datum:列表,每条数据都是一个代理的详细数据
第五步
将数据写入文件
for i in range(len(Data)):
Data=Datum[i]
with open("./Proxys/"+str(i)+".txt","w") as f:
f.write("ID:"+str(i)+"\n")
f.write("Address:"+Data["ip"]+"\n")
f.write("Port:"+str(Data["port"])+"\n")
f.write("Type:"+str(Data["type"])+"\n")
f.write("Https:"+str(Data["https"])+"\n")
f.write("Duration:"+Data["duration"]+"\n")
f.write("Percent:"+str(Data["percent"])+"\n")
f.write("Time:"+str(Data["time"])+"\n")
print("第{}条数据写入完毕!".format(str(i+1)))
第六步
判断用户返回值格式(text)
elif Arguments["format"]=="text":
请求:
Response=requests.get("http://api.newday.me/proxy/extract",Arguments).text
数据格式化(转换列表)
Data=" ".join(Response.split())
Datum=Data.split(" ")
for i in range(len(Datum)):
a=Datum[i]
b=a.split(":")
说明:
Data:将请求到的数据用空格分隔
Datum:将Data用空格分离数据进列表
i:Datum的长度
a:提取Data里的第i条数据
b:将a用冒号分离数据进列表
第七步
将数据保存进文件:
for i in range(len(b)):
with open("./Proxys/"+str(i)+".txt","w") as f:
f.write("Address:"+b[i]+"\n")
f.write("Port:"+b[i]+"\n")
print("第{}条数据写入完毕!".format(str(i+1)))
第八步,也是本文重点
1.判断用户返回值类型(xml)
elif Arguments["format"]=="xml":
2.获取数据
Response=requests.get("http://api.newday.me/proxy/extract",Arguments).text
3.创建XML处理程序变量
root=ET.fromstring(Response)
4.定义变量
Data=[]
i=0
说明:
Data:数据存储变量
i:循环变量
5.将数据转换为列表
for iterm in root.iterfind("data/list/item"):
Data.append(iterm)
6.保存数据
for iterm in Data:
ip=iterm.findtext("ip")
port=iterm.findtext("port")
type=iterm.findtext("type")
https=iterm.findtext("https")
duration=iterm.findtext("duration")
percent=iterm.findtext("percent")
time=iterm.findtext("time")
with open("./Proxys/"+str(i)+".txt","w") as f:
f.write("ID:"+str(i)+"\n")
f.write("Address:"+ip+"\n")
f.write("Port:"+str(port)+"\n")
f.write("Type:"+str(type)+"\n")
f.write("Https:"+str(https)+"\n")
f.write("Duration:"+duration+"\n")
f.write("Percent:"+str(percent)+"\n")
f.write("Time:"+str(time)+"\n")
print("第{}条数据写入完毕!".format(str(i+1)))
i+=1
源代码
import requests,xml.etree.ElementTree as ET
Arguments={
"https":input("是否支持HTTPS,0,不限;1,HTTPS代理,请输入:"),
"type":input("代理类型,0,不限;1,透明代理;2,匿名代理;3,高匿代理,请输入:"),
"format":input("返回格式,text,文本;json,JSON;xml,XML,请输入:"),
"token":你的token
}
if Arguments["format"]=="json":
Response=requests.get("http://api.newday.me/proxy/extract",Arguments).json()
Data=Response["data"]["list"]
Datum=[]
for i in range(len(Data)):
Datum.append(Data[i])
for i in range(len(Data)):
Data=Datum[i]
with open("./Proxys/"+str(i)+".txt","w") as f:
f.write("ID:"+str(i)+"\n")
f.write("Address:"+Data["ip"]+"\n")
f.write("Port:"+str(Data["port"])+"\n")
f.write("Type:"+str(Data["type"])+"\n")
f.write("Https:"+str(Data["https"])+"\n")
f.write("Duration:"+Data["duration"]+"\n")
f.write("Percent:"+str(Data["percent"])+"\n")
f.write("Time:"+str(Data["time"])+"\n")
print("第{}条数据写入完毕!".format(str(i+1)))
elif Arguments["format"]=="text":
Response=requests.get("http://api.newday.me/proxy/extract",Arguments).text
Data=" ".join(Response.split())
Datum=Data.split(" ")
for i in range(len(Datum)):
a=Datum[i]
b=a.split(":")
for i in range(len(b)):
with open("./Proxys/"+str(i)+".txt","w") as f:
f.write("Address:"+b[i]+"\n")
f.write("Port:"+b[i]+"\n")
print("第{}条数据写入完毕!".format(str(i+1)))
elif Arguments["format"]=="xml":
Response=requests.get("http://api.newday.me/proxy/extract",Arguments).text
root=ET.fromstring(Response)
Data=[]
i=0
for iterm in root.iterfind("data/list/item"):
Data.append(iterm)
for iterm in Data:
ip=iterm.findtext("ip")
port=iterm.findtext("port")
type=iterm.findtext("type")
https=iterm.findtext("https")
duration=iterm.findtext("duration")
percent=iterm.findtext("percent")
time=iterm.findtext("time")
with open("./Proxys/"+str(i)+".txt","w") as f:
f.write("ID:"+str(i)+"\n")
f.write("Address:"+ip+"\n")
f.write("Port:"+str(port)+"\n")
f.write("Type:"+str(type)+"\n")
f.write("Https:"+str(https)+"\n")
f.write("Duration:"+duration+"\n")
f.write("Percent:"+str(percent)+"\n")
f.write("Time:"+str(time)+"\n")
print("第{}条数据写入完毕!".format(str(i+1)))
i+=1