近两日,在网易云课堂上看了一个抓取拉勾网招聘信息的视频教程。学习颇多,以此记录。
系统:Ubuntu16.04、Pycharm2017、python3.5+、Google Chrome。
抓取的是拉勾网有关python的招聘信息的关键词。效果如图:
下面是学习步骤以及心得记录:
一、引入库
需要导入的外接库是requests库和beautifulSoup库。这两个库都是爬虫里非常常见的库。导入的时候遇到一个问题,我以为创建项目的时候解释器选择python3.5的话就可以了,因为我的系统已经通过terminal把这些库都装上了。但是还是不行。所以我又在PycharmIDE上装了这两个库。File->Project:lagou(我的项目名叫拉钩)->Project Interpreter->点击右侧加号->搜索自己想要的库->installpackage。
这样我的第一个问题解决了。
二、分析拉勾网的布局情况
我们百度搜索拉勾网,进入页面,然后我们搜索python。如下:
我们现在想得到的信息就是那些个工程师的详细信息。初代码为:
import requests
from bs4 import BeautifulSoup
def main():
result = requests.get('https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=')
print(result.content)
if __name__ == '__main__':
main()
结果却是一小段完全不相关的数据,并且还是中文乱码。如下:
先解决中文乱码问题,上代码和结果图:
那么为什么我们得到的数据不是拉勾网真实数据呢?这是因为一般情况下,我们的program发出的Http请求的头和正常的浏览器发出的Http请求头是不一样的。我截取我的chrome浏览器访问拉钩python页面的请求头的基本情况:
而通常用到的urllib库都会发送如下请求头:
因此,我们通过requests模块自定义请求头,来使得自己的程序更加像Human。所以程序更改如下:
import requests
from bs4 import BeautifulSoup
def main():
_headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
'Host': 'www.lagou.com',
'Cookie': 'JSESSIONID=ABAAABAACBHABBIB0D40B938A14EAEE96D4254851D7B2A4; user_trace_token=20180128095608-f38ffda1-4fa5-405d-a9a6-ca3e903874cd; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; _ga=GA1.2.1594195246.1517104570; _gid=GA1.2.1758576376.1517104570; _gat=1; LGSID=20180128095610-69dda930-03ce-11e8-abb7-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGRID=20180128095610-69ddab7b-03ce-11e8-abb7-5254005c3644; LGUID=20180128095610-69ddacbd-03ce-11e8-abb7-5254005c3644; SEARCH_ID=618cfe9d82964f4f824480332d52f19e'
}
result = requests.get('https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',headers=_headers)
html = str(result.content,'utf-8')
print(html)
if __name__ == '__main__':
main()
但是结果仍然不近人意,依然是垃圾数据。
这是因为拉钩网用到了Ajax技术。我们在chrome的工具里选择Network,然后选择XHR,会发现四个连接(如果是空的,就刷新网页),选中第一个连接,先看Preview,里面的content-positionResult-result内就是我么想要的数据。是所有职位的列表。
所以回到Headers内,我们可以看到,Request Method实际上不是GET而是POST,Requests Headers内多了这些内容,还有Form Data的增加:
于是,代码更改如下:
import requests
from bs4 import BeautifulSoup
def main():
_headers = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest', #Ajax
'Cookie':'JSESSIONID=ABAAABAACBHABBIB0D40B938A14EAEE96D4254851D7B2A4; user_trace_token=20180128095608-f38ffda1-4fa5-405d-a9a6-ca3e903874cd; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; _ga=GA1.2.1594195246.1517104570; _gid=GA1.2.1758576376.1517104570; _gat=1; LGSID=20180128095610-69dda930-03ce-11e8-abb7-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGRID=20180128095610-69ddab7b-03ce-11e8-abb7-5254005c3644; LGUID=20180128095610-69ddacbd-03ce-11e8-abb7-5254005c3644; SEARCH_ID=618cfe9d82964f4f824480332d52f19e'
}
_data = {
'first': 'true',
'pn': '1',
'kd': 'python'
}
result = requests.post('https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0',
headers=_headers, data=_data)
html = str(result.content,'utf-8')
print(html)
if __name__ == '__main__':
main()
我们登录www.json.cn,然后把console的输出结果copy进去,得到了规范的json格式,发现正是我们要的数据,说明我们捕获成功。截图如下:
现在我们想把result内的信息保存到json文件里,代码如下:
import requests
from bs4 import BeautifulSoup
import time
import json
def main():
_headers = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest', #Ajax
'Cookie':'JSESSIONID=ABAAABAACBHABBIB0D40B938A14EAEE96D4254851D7B2A4; user_trace_token=20180128095608-f38ffda1-4fa5-405d-a9a6-ca3e903874cd; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; _ga=GA1.2.1594195246.1517104570; _gid=GA1.2.1758576376.1517104570; _gat=1; LGSID=20180128095610-69dda930-03ce-11e8-abb7-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGRID=20180128095610-69ddab7b-03ce-11e8-abb7-5254005c3644; LGUID=20180128095610-69ddacbd-03ce-11e8-abb7-5254005c3644; SEARCH_ID=618cfe9d82964f4f824480332d52f19e'
}
_data = {
'first': 'true',
'pn': '1',
'kd': 'python'
}
result = requests.post('https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0',
headers=_headers, data=_data)
json_result = result.json()
#json_result是个字典,我们要得到result里面的信息。
positions = json_result['content']['positionResult']['result']
for position in positions:
print('-' * 40)
print(position)
time.sleep(5)
line = json.dumps(positions, ensure_ascii=False)
with open('lagou.json', 'w') as fp:
fp.write(line)
if __name__ == '__main__':
main()
我们看到的结果截图:
多了一个上海的,因为拉钩刚刚发布了一个。
三、实现翻页抓取
现在能把第一页的职位信息抓取了,但是我们想要抓取10页的或者更多的职位信息怎么办。我们看到拉勾网一共显示了30页。那么每页的元素区别在哪里呢?
我们鼠标点击第二页,发现如下变化,Form Data里面的first变成了false,pn变成了2。
因此我们的代码如下:
import requests
from bs4 import BeautifulSoup
import time
import json
def main():
_headers = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest', #Ajax
'Cookie':'JSESSIONID=ABAAABAACBHABBIB0D40B938A14EAEE96D4254851D7B2A4; user_trace_token=20180128095608-f38ffda1-4fa5-405d-a9a6-ca3e903874cd; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; _ga=GA1.2.1594195246.1517104570; _gid=GA1.2.1758576376.1517104570; _gat=1; LGSID=20180128095610-69dda930-03ce-11e8-abb7-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGRID=20180128095610-69ddab7b-03ce-11e8-abb7-5254005c3644; LGUID=20180128095610-69ddacbd-03ce-11e8-abb7-5254005c3644; SEARCH_ID=618cfe9d82964f4f824480332d52f19e'
}
positions = []
for x in range(1, 3):
if x == 1:
first_bool = 'true'
else:
first_bool = "false"
_data = {
'first': first_bool,
'pn': x,
'kd': 'python'
}
result = requests.post(
'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0',
headers=_headers, data=_data)
json_result = result.json()
page_positions = json_result['content']['positionResult']['result']
print('--' * 30)
print(x)
print(page_positions)
print('--' * 30)
pass
if __name__ == '__main__':
main()
我们获取了第一页和第二页的职位信息。如果想要获取20多页的职位信息,拉钩会拒绝你的访问。解决的方式只有两个:
1.减少页面访问量。
2.让程序的sleep()长一点。
我们也可以把这些翻页的信息down到lagou.json文件内。
四、信息过滤
我们得到这些信息了,但是我们想要更加精准而重要的信息。比如说,职位名称,城市,公司名称,薪水,应聘条件。那么我们应该怎么做呢?先上代码,这也是最终的代码。
import requests
import json
from bs4 import BeautifulSoup
import time
def crwalDetail(id):
url = 'https://www.lagou.com/jobs/%s.html' %id
d_headers = {
'Host':'www.lagou.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'
}
d_result = requests.get(url,headers = d_headers)
soup = BeautifulSoup(d_result.content,'html.parser')
job_bt = soup.find('dd',attrs={'class':'job_bt'})
purejob_bt = job_bt.text
return purejob_bt
def main():
_headers = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36',
'Host':'www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':None,
'X-Requested-With':'XMLHttpRequest', #Ajax
'Cookie':'JSESSIONID=ABAAABAACBHABBIB0D40B938A14EAEE96D4254851D7B2A4; user_trace_token=20180128095608-f38ffda1-4fa5-405d-a9a6-ca3e903874cd; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1517104570; _ga=GA1.2.1594195246.1517104570; _gid=GA1.2.1758576376.1517104570; _gat=1; LGSID=20180128095610-69dda930-03ce-11e8-abb7-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGRID=20180128095610-69ddab7b-03ce-11e8-abb7-5254005c3644; LGUID=20180128095610-69ddacbd-03ce-11e8-abb7-5254005c3644; SEARCH_ID=618cfe9d82964f4f824480332d52f19e'
}
_data = {
'first':'true',
'pn':'1',
'kd':'python'
}
positions = []
for x in range(1,3):
if x==1:
first_bool = 'true'
else:
first_bool = "false"
_data={
'first': first_bool,
'pn': x,
'kd': 'python'
}
result = requests.post('https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0',
headers=_headers, data=_data)
json_result = result.json()
page_positions = json_result['content']['positionResult']['result']
for position in page_positions:
position_dict={
'position_name':position['positionName'],
'work_year':position['workYear'],
'city':position['city'],
'district':position['district'],
'salary':position['salary'],
'companyFullName':position['companyFullName']
}
position_id = position['positionId']
print(position_id)
position_detail = crwalDetail(position_id)
position_dict['position_detail']=position_detail
positions.append(position_dict)
time.sleep(20)
time.sleep(10)
line = json.dumps(positions,ensure_ascii=False)
with open('lagou.json','w') as fp:
fp.write(line)
pass
if __name__ =='__main__':
main()
# crwalDetail(4049431)
这里我们定义了crawalDetail函数,让它抓取详细的职位信息,比如说这样的信息
在刚开始的时候,可以先不用main()调用它,而是直接把具体的id传给他。结果保存在lagou.json文件内。
也就是刚刚展示的效果。