1.什么是ajax数据爬取:
通常我们在使用requests抓取页面的时候,得到的html源码可能和在浏览器中看到的不一样,在页面上则可以看到数据,这是因为数据是听过ajax异步加载的,原始页面不会包含某些数据,原始页面加载完之后,会向服务区请求某个接口获取数据,然后数据才会被呈现在页面上,这其实就是发送了一个ajax请求。
2.如何爬取?
可通过requests和urllib这两个库来爬取数据:
模拟请求接口的方式 把想要的信息提取出来。
可在代码中模拟页面上的url,header,param,请求方式等信息来发送请求,来请求服务器接口来获取所需信息。如果有翻页的话,参数中应该也有page参数,可以通过range函数循环页码把所所需的页的信息抓取出来。
3.具体代码:
3-1:网址:http://58.18.38.116/erds_mt/ext/index/index.jsp 3-2:接口返回结果为xml类型
3-3:python 为2.7版本
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib import urlencode
from xml.dom import minidom
from xlwt import *
import numpy as np
import MySQLdb
import datetime
import time
import sys
reload(sys)
sys.setdefaultencoding('utf8')
# 定义存储数组
# 昨天块煤
yesterdayKuaiMeiList = []
# 昨天混煤
yesterdayHunMeiList = []
# 昨天其他煤品
yesterdayOtherMeiList = []
# 昨天粉煤
yesterdayPinkMeiList = []
# 地区
diquList = []
# 销量
sellList = []
# 省份 是变化的
shengList = []
# # Otog Banner
# OtogBannerList = []
# # Otog Front Banner
# OtogFrontBanner = []
# DaladBanner = []
# WushenCounty = []
# JungarBanner = []
# FullownedcentralenterpriseList = []
# ikinholoList = []
# DongshengDistrict = []
# HangjinBanner = []
# 发出请求获得HTML源码
def get_html(url):
# 指定一个浏览器头 模拟爬取页面上的header信息
headers = {
'Host': '58.18.38.116',
'Referer': 'http://58.18.38.116/erds_mt/indexAction.do?methodName=query_yxxx',
'X-Requested-With': 'ShockwaveFlash/32.0.0.192',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
# 代理,免费的代理只能维持一会可能就没用了,自行更换
resp = requests.get(url, headers=headers)
# proxies = {'http': '111.23.10.27:8080'}
# try:
# print ("-------------------------")
# # Requests库的get请求
# print "uuuu :" + url
#
# print ('没使用代理ip')
# except:
# print ('使用了代理ip')
# # 如果请求被阻,就使用代理
# resp = requests.get(url, headers=headers, proxies=proxies)
return resp
# 获取鄂尔多斯煤炭公路产销页面的所需信息
def get_data():
#模拟爬取页面上的参数信息
params = {
'methodName': 'query_mtcx_xlxx_tb_print',
'FCTime': '726'
}
# 模拟页面上的请求接口名 ;urlencode里面放入请求参数
url = 'http://58.18.38.116/erds_mt/indexAction.do?' + urlencode(params)
resp = get_html(url)
# 请求接口 打印响应码 为200则打印返回信息 提取所需信息,如不为200 则打印出错误信息
print '请求状态: ', resp.status_code
try:
if resp.status_code == 200:
xmlResponse = resp.text
print xmlResponse
# 解析xml
dom = minidom.parseString(xmlResponse)
value = dom.getElementsByTagName('set')
print '鄂托克旗 昨天块煤:', value[0].getAttribute('value')
print '鄂托克前旗 昨天块煤:', value[1].getAttribute('value')
print '达拉特旗 昨天块煤:', value[2].getAttribute('value')
print '乌审旗 昨天块煤:', value[3].getAttribute('value')
print '准格尔旗 昨天块煤:', value[4].getAttribute('value')
print '全资央企 昨天块煤:', value[5].getAttribute('value')
print '伊金霍洛旗 昨天块煤:', value[6].getAttribute('value')
print '东胜区 昨天块煤:', value[7].getAttribute('value')
print '杭锦旗 昨天块煤:', value[8].getAttribute('value')
# add yesterdayKuaiMeiList 把各地区的块煤存入list
for i in range(9):
yesterdayKuaiMeiList.append(value[i].getAttribute('value'))
print '======================================================='
print '鄂托克旗 昨天混煤:', value[9].getAttribute('value')
print '鄂托克前旗 昨天混煤:', value[10].getAttribute('value')
print '达拉特旗 昨天混煤:', value[11].getAttribute('value')
print '乌审旗 昨天混煤:', value[12].getAttribute('value')
print '准格尔旗 昨天混煤:', value[13].getAttribute('value')
print '全资央企 昨天混煤:', value[14].getAttribute('value')
print '伊金霍洛旗 昨天混煤:', value[15].getAttribute('value')
print '东胜区 昨天混煤:', value[16].getAttribute('value')
print '杭锦旗 昨天混煤:', value[17].getAttribute('value')
# add yesterdayHunMeiList
for i in range(9, 18):
yesterdayHunMeiList.append(value[i].getAttribute('value'))
print '======================================================='
print '鄂托克旗 昨天其他煤品:', value[18].getAttribute('value')
print '鄂托克前旗 昨天其他煤品:', value[19].getAttribute('value')
print '达拉特旗 昨天其他煤品:', value[20].getAttribute('value')
print '乌审旗 昨天其他煤品:', value[21].getAttribute('value')
print '准格尔旗 昨天其他煤品:', value[22].getAttribute('value')
print '全资央企 昨天其他煤品:', value[23].getAttribute('value')
print '伊金霍洛旗 昨天其他煤品:', value[24].getAttribute('value')
print '东胜区 昨天其他煤品:', value[25].getAttribute('value')
print '杭锦旗 昨天其他煤品:', value[26].getAttribute('value')
# add yesterdayOtherMeiList
for i in range(18, 27):
yesterdayOtherMeiList.append(value[i].getAttribute('value'))
print '======================================================='
print '鄂托克旗 昨天粉煤:', value[27].getAttribute('value')
print '鄂托克前旗 昨天粉煤:', value[28].getAttribute('value')
print '达拉特旗 昨天粉煤:', value[29].getAttribute('value')
print '乌审旗 昨天粉煤:', value[30].getAttribute('value')
print '准格尔旗 昨天粉煤:', value[31].getAttribute('value')
print '全资央企 昨天粉煤:', value[32].getAttribute('value')
print '伊金霍洛旗 昨天粉煤:', value[33].getAttribute('value')
print '东胜区 昨天粉煤:', value[34].getAttribute('value')
print '杭锦旗 昨天粉煤:', value[35].getAttribute('value')
# add yesterdayPinkMeiList
for i in range(27, 36):
yesterdayPinkMeiList.append(value[i].getAttribute('value'))
except requests.ConnectionError as e:
print ('Error', e.args)
# 获取煤炭产品按地区销量统计页面的信息
def get_data1():
params = {
'methodName': 'query_mtcx_xlxx_mtywd_print',
'type': '1'
}
url = 'http://58.18.38.116/erds_mt/indexAction.do?' + urlencode(params)
resp = get_html(url)
print '请求状态: ', resp.status_code
try:
if resp.status_code == 200:
xmlResponse = resp.text
print 'xmlResponseSell : ' , xmlResponse
dom = minidom.parseString(xmlResponse)
value = dom.getElementsByTagName('set')
print '内蒙古自治区:', value[0].getAttribute('value')
print '河北省:', value[1].getAttribute('value')
print '山西省:', value[2].getAttribute('value')
print '陕西省:', value[3].getAttribute('value')
print '宁夏:', value[4].getAttribute('value')
print '山东省:', value[5].getAttribute('value')
print '天津市:', value[6].getAttribute('value')
print '辽宁省:', value[7].getAttribute('value')
print '河南省:', value[8].getAttribute('value')
print '黑龙江:', value[9].getAttribute('value')
for i in range(10):
sellList.append(value[i].getAttribute('value'))
shengList.append(value[i].getAttribute('label'))
except requests.ConnectionError as e:
print ('Error', e.args)
# 把list数据存入excel 指定的sheet页中
def insert_csv():
print yesterdayKuaiMeiList
print yesterdayHunMeiList
print yesterdayOtherMeiList
print yesterdayPinkMeiList
w = Workbook(encoding='utf-8')
ws = w.add_sheet('鄂尔多斯煤炭公路产销')
# 第一行 煤的种类
ws.write(0, 0, '块煤')
ws.write(0, 13,'混煤')
ws.write(0, 27, '其他煤品')
ws.write(0, 40,'粉煤')
# 第二行 各个地区
ws.write(1, 0, 'Date')
ws.write(1, 1, 'Otog\nBanner ')
ws.write(1, 2, 'Otog\nFront\nBanner')
ws.write(1, 3, 'Dalad\nBanner')
ws.write(1, 4, 'Wushen\nCounty')
ws.write(1, 5, 'Jungar\nBanner')
ws.write(1, 6, 'Full-owned\ncentral\nenterprise')
ws.write(1, 7, 'ikinholo')
ws.write(1, 8, 'Dongsheng\nDistrict ')
ws.write(1, 9, 'Hangjin\nBanner')
ws.write(1, 10, 'TOTAL')
ws.write(1, 11, 'Lump\ncoal\n5dma')
ws.write(1, 12, ' ')
ws.write(1, 13, 'Date')
ws.write(1, 14, 'Otog\nBanner ')
ws.write(1, 15, 'Otog\nFront\nBanner')
ws.write(1, 16, 'Dalad\nBanner')
ws.write(1, 17, 'Wushen\nCounty')
ws.write(1, 18, 'Jungar\nBanner')
ws.write(1, 19, 'Full-owned\ncentral\nenterprise')
ws.write(1, 20, 'ikinholo')
ws.write(1, 21, 'Dongsheng\nDistrict')
ws.write(1, 22, 'Hangjin\nBanner')
ws.write(1, 23, 'TOTAL')
ws.write(1, 24, 'powdered\ncoal\n5dma')
ws.write(1, 26, ' ')
ws.write(1, 27, 'Date')
ws.write(1, 28, 'Otog\nBanner ')
ws.write(1, 29, 'Otog\nFront\nBanner')
ws.write(1, 30, 'Dalad\nBanner')
ws.write(1, 31, 'Wushen\nCounty')
ws.write(1, 32, 'Jungar\nBanner')
ws.write(1, 33, 'Full-owned\ncentral\nenterprise')
ws.write(1, 34, 'ikinholo')
ws.write(1, 35, 'Dongsheng\nDistrict')
ws.write(1, 36, 'Hangjin\nBanner')
ws.write(1, 37, 'TOTAL')
ws.write(1, 38, 'Blending\ncoal\n5dma')
ws.write(1, 39, ' ')
ws.write(1, 40, 'Date')
ws.write(1, 41, 'Otog\nBanner')
ws.write(1, 42, 'Otog\nFront\nBanner')
ws.write(1, 43, 'Dalad\nBanner')
ws.write(1, 44, 'Wushen\nCounty')
ws.write(1, 45, 'Jungar\nBanner')
ws.write(1, 46, 'Full-owned\ncentral\nenterprise')
ws.write(1, 47, 'ikinholo')
ws.write(1, 48, 'Dongsheng\nDistrict')
ws.write(1, 49, 'Hangjin\nBanner')
ws.write(1, 50, 'TOTAL')
ws.write(1, 51, 'Other\ncoal\n5dma')
ws.write(1, 52, ' ')
# 获取昨天日期
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
# 获取各个品种的煤的销售总量
List = getZTotal()
# 各个地区的块煤
ws.write(3, 0, yesterday.__str__())
ws.write(3, 1, yesterdayKuaiMeiList[0])
ws.write(3, 2, yesterdayKuaiMeiList[1])
ws.write(3, 3, yesterdayKuaiMeiList[2])
ws.write(3, 4, yesterdayKuaiMeiList[3])
ws.write(3, 5, yesterdayKuaiMeiList[4])
ws.write(3, 6, yesterdayKuaiMeiList[5])
ws.write(3, 7, yesterdayKuaiMeiList[6])
ws.write(3, 8, yesterdayKuaiMeiList[7])
ws.write(3, 9, yesterdayKuaiMeiList[8])
# total
ws.write(3, 10,List[0] )
# 各个地区的混煤
ws.write(3, 13, yesterday.__str__())
ws.write(3, 14, yesterdayHunMeiList[0])
ws.write(3, 15, yesterdayHunMeiList[1])
ws.write(3, 16, yesterdayHunMeiList[2])
ws.write(3, 17, yesterdayHunMeiList[3])
ws.write(3, 18, yesterdayHunMeiList[4])
ws.write(3, 19, yesterdayHunMeiList[5])
ws.write(3, 20, yesterdayHunMeiList[6])
ws.write(3, 21, yesterdayHunMeiList[7])
ws.write(3, 22, yesterdayHunMeiList[8])
ws.write(3, 23, List[1])
# 各个地区的其他煤
ws.write(3, 27, yesterday.__str__())
ws.write(3, 28, yesterdayOtherMeiList[0])
ws.write(3, 29, yesterdayOtherMeiList[1])
ws.write(3, 30, yesterdayOtherMeiList[2])
ws.write(3, 31, yesterdayOtherMeiList[3])
ws.write(3, 32, yesterdayOtherMeiList[4])
ws.write(3, 33, yesterdayOtherMeiList[5])
ws.write(3, 34, yesterdayOtherMeiList[6])
ws.write(3, 35, yesterdayOtherMeiList[7])
ws.write(3, 36, yesterdayOtherMeiList[8])
ws.write(3, 37, List[3])
# 各个地区的粉煤
ws.write(3, 40, yesterday.__str__())
ws.write(3, 41, yesterdayPinkMeiList[0])
ws.write(3, 42, yesterdayPinkMeiList[1])
ws.write(3, 43, yesterdayPinkMeiList[2])
ws.write(3, 44, yesterdayPinkMeiList[3])
ws.write(3, 45, yesterdayPinkMeiList[4])
ws.write(3, 46, yesterdayPinkMeiList[5])
ws.write(3, 47, yesterdayPinkMeiList[6])
ws.write(3, 48, yesterdayPinkMeiList[7])
ws.write(3, 49, yesterdayPinkMeiList[8])
ws.write(3, 50, List[2])
# 煤炭销量
ws1 = w.add_sheet('煤炭产品按地区销量统计')
# 第一行
ws1.write(0, 0, '煤炭产品按地区销量统计(万吨)')
# 第二行 各个地区
ws1.write(1, 0, '日期')
ws1.write(1, 1, shengList[0])
ws1.write(1, 2, shengList[1])
ws1.write(1, 3, shengList[2])
ws1.write(1, 4, shengList[3])
ws1.write(1, 5, shengList[4])
ws1.write(1, 6, shengList[5])
ws1.write(1, 7, shengList[6])
ws1.write(1, 8, shengList[7])
ws1.write(1, 9, shengList[8])
ws1.write(1, 10, shengList[9])
#
ws1.write(1, 11, 'Other\nProvince')
ws1.write(1, 12, 'TOTAL')
ws1.write(1, 13, 'Last\nWeek')
ws1.write(1, 14, 'Last\nmonth')
ws1.write(1, 15, 'Last\nyear')
ws1.write(1, 16, 'powdered coal 5dma')
ws1.write(3, 0, yesterday.__str__())
ws1.write(3, 1, sellList[0])
ws1.write(3, 2, sellList[1])
ws1.write(3, 3, sellList[2])
ws1.write(3, 4, sellList[3])
ws1.write(3, 5, sellList[4])
ws1.write(3, 6, sellList[5])
ws1.write(3, 7, sellList[6])
ws1.write(3, 8, sellList[7])
ws1.write(3, 9, sellList[8])
ws1.write(3, 10, sellList[9])
xTotal = float(sellList[0])+float(sellList[1])+float(sellList[2])+float(sellList[3])+float(sellList[4])+float(sellList[5])+float(sellList[6])+float(sellList[7])+float(sellList[8])+float(sellList[9])
print 'xTotal',xTotal
zz = float(float(List[4])/10000L) - float(xTotal)
print 'zz : ' ,zz
ws1.write(3, 11,zz)
ws1.write(3, 12,float(List[4])/10000L)
ws1.write(3, 13, week())
ws1.write(3, 14, month())
ws1.write(3, 15, year())
ws1.write(3, 16,0 )
xlsName = yesterday.__str__() + 'digital_coal.xls'
w.save(xlsName)
shiZongSell = 0
# 获取我市共销售煤炭产品 /万吨
def getZTotal():
params = {
'methodName': 'query_yxxx'
}
url = 'http://58.18.38.116/erds_mt/indexAction.do?' + urlencode(params)
resp = get_html(url)
print '请求状态: ', resp.status_code
try:
if resp.status_code == 200:
htmlResponse = resp.text
# print 'htmlResponse',htmlResponse
soup = BeautifulSoup(htmlResponse.__str__(), 'lxml')
Zsel = soup.findAll(attrs={'style' : 'font-size: 15'})
for z in Zsel:
print z.string
shiZongSell = z.string
print 'shiZongSell',shiZongSell[0:31].lstrip()
start = shiZongSell[0:31].lstrip().find('品')
end = shiZongSell[0:31].lstrip().find('吨',1)
print '我市总销量:', shiZongSell[0:31].lstrip()[start+1:end]
zSe = shiZongSell[0:31].lstrip()[start+1:end]
# 获取块煤 total
kuaiMeiToal = shiZongSell[34:44].lstrip()
print 'kuaiMeiToal:',kuaiMeiToal
start = shiZongSell[34:44].lstrip().find('煤')
end = shiZongSell[34:44].lstrip().find('吨', 1)
kuaiMeiToal = shiZongSell[34:44].lstrip()[start+1:end]
print 'kuaiMeiToal : ',kuaiMeiToal
# 获取混煤 total
HunMeiToal = shiZongSell[46:58].lstrip()
print 'HunMeiToal:', HunMeiToal
start = shiZongSell[46:58].lstrip().find('煤')
end = shiZongSell[46:58].lstrip().find('吨', 1)
HunMeiToal = shiZongSell[46:58].lstrip()[start + 1:end]
print 'HunMeiToal : ', HunMeiToal
# 获取粉煤 total
pinkMeiToal = shiZongSell[59:72].lstrip()
print 'pinkMeiToal:', pinkMeiToal
start = shiZongSell[59:72].lstrip().find('煤')
end = shiZongSell[59:72].lstrip().find('吨', 1)
pinkMeiToal = shiZongSell[59:72].lstrip()[start + 1:end]
print 'pinkMeiToal : ', pinkMeiToal
# 获取其他煤 total
otherMeiToal = shiZongSell[76:87].lstrip()
print 'otherMeiToal:', otherMeiToal
start = shiZongSell[76:87].lstrip().find('品')
end = shiZongSell[76:87].lstrip().find('吨', 1)
otherMeiToal = shiZongSell[76:87].lstrip()[start + 1:end]
print 'pinkMeiToal : ', otherMeiToal
ZZList = [kuaiMeiToal,HunMeiToal,pinkMeiToal,otherMeiToal,zSe]
return ZZList
except requests.ConnectionError as e:
print ('Error', e.args)
# 获取一周的销量
def week():
params = {
'methodName': 'query_mtcx_xlxx_bt_print',
'type':'2'
}
url = 'http://58.18.38.116/erds_mt/indexAction.do?' + urlencode(params)
resp = get_html(url)
print '请求状态: ', resp.status_code
try:
if resp.status_code == 200:
weekResponse = resp.text
# print 'week : ',weekResponse
dom = minidom.parseString(weekResponse)
value = dom.getElementsByTagName('set')
print value[6].getAttribute('value')
return value[6].getAttribute('value')
except requests.ConnectionError as e:
print ('Error', e.args)
# 获取一月的销量
def month():
params = {
'methodName': 'query_mtcx_xlxx_bt_print',
'type':'3'
}
url = 'http://58.18.38.116/erds_mt/indexAction.do?' + urlencode(params)
resp = get_html(url)
print '请求状态: ', resp.status_code
try:
if resp.status_code == 200:
monthResponse = resp.text
# print 'month : ',monthResponse
dom = minidom.parseString(monthResponse)
value = dom.getElementsByTagName('set')
print value[30].getAttribute('value')
return value[30].getAttribute('value')
except requests.ConnectionError as e:
print ('Error', e.args)
# 获取一年的销量
def year():
params = {
'methodName': 'query_mtcx_xlxx_bt_print',
'type':'4'
}
url = 'http://58.18.38.116/erds_mt/indexAction.do?' + urlencode(params)
resp = get_html(url)
print '请求状态: ', resp.status_code
try:
if resp.status_code == 200:
yearResponse = resp.text
# print 'year : ',yearResponse
dom = minidom.parseString(yearResponse)
value = dom.getElementsByTagName('set')
print value.length
sum = 0
for i in range(value.length):
sum += float(value[i].getAttribute('value'))
print 'sum : ' ,sum
return sum
except requests.ConnectionError as e:
print ('Error', e.args)
if __name__ == "__main__":
get_data()
get_data1()
insert_csv()
# week()
# month()
# year()
# List = getZTotal()
# print '1:',List[0]
# print '2:', List[1]
# print '3:', List[2]
# print '4:', List[3]
print ("爬虫结束...")