背景
周末帮女友手查的各大厂薪资情况,忙活了一个下午,真的是好无聊啊,所以决定写一个爬虫程序,自动爬取。
图片offershow界面,以下采用秀代替offer秀
因为本人本地开发环境是golang,所以还是采用golang,需求目标是爬取各大厂的薪资情况生成excel文档,用户可以输入筛选条件,如公司、学校、学历等信息,然后只输出筛选后的数据。
抓包分析
爬虫最重要的一步,抓包分析http请求包括头和相应报文内容,因为我们爬虫时就是需要构造http请求报文和解析响应报文。其次爬微信小程序和爬web网页没有什么区别,只是网页版可以使用浏览器F12抓包,但是微信小程序相当于一个APP,需要采用抓包工具来抓包分析,抓包工具有很多,这里选择charles抓包,charles是专门的针对http和https的抓包工具,并且可以设置SSL代理,可以显示https的报文明文内容。
charles设置SSL代理,offer秀采用https协议:
- 安装charles SSL根证书
帮助->SSL代理->安装charles root证书->常规->安装证书->本地计算机->将所有证书放到下列存储->受信任的证书颁发机构 - SSL代理设置PC本机为代理
代理->SSL代理设置->SSL代理->添加
主机: *.*
端口: 443
打开电脑短微信小程序offer秀,并点击华为等大厂,查看offer情况,然后charles抓包查看请求和响应
抓包结果
请求
头部:
POST //webapi/v2/search_salary HTTP/1.1
Host: www.ioffershow.com
Connection: keep-alive
Content-Length: 221
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36 MicroMessenger/7.0.9.501 NetType/WIFI MiniProgramEnv/Windows WindowsWechat
content-type: application/x-www-form-urlencoded
Referer: https://servicewechat.com/wx67fbba6cd94591e4/44/page-frame.html
Accept-Encoding: gzip, deflate, br
- 采用POST方法
- URI目录是/webapi/v2/search_salary,所以url全名为https://www.ioffershow.com/webapi/v2/search_salary
- 代理设置为有微信标识的,爬虫是头部参数构造和该报文完全一致
body:
content=%E5%8D%8E%E4%B8%BA&search_priority=1&ordertype=2&part_school=&year=&education=%E5%85%A8%E9%83%A8&access_token=%24ytkzhLIvv5%2BsYwytrpIDkg26d4HQpxjr6pCoffershowzju1qaz.1626568859971.15aeb732580bc4957cddcb71075cef51
body中主要是查询信息,如公司名、学历等,其中中文是unicode编码,charles显示会乱码,如
content=%E5%8D%8E%E4%B8%BA, 该乱码字符串就是中文字符“华为”。可以采用url编码在线转换工具,转换成中文显示。
另外注意有token信息,那么token生成规则是什么样的呢?很难去查证,不过经过本人实验,好像服务器并没有做token验证,但是得有access_token这个字段, 所以access_token值可以为任意字符串,这里我们还是使用和上述一样的值。爬虫中请求报文构造以上字符串作为http的body即可。
响应报文
响应报文是json字符串,其中info数组中保存的是每个人的信息,不过其中有发布人ip的字段。。。看来offer秀也是留了个后门啊。。。匿名了,也没有完全匿名。
响应报文处理解析json字符串,得到每个人的信息即可。
代码
我们的目标是做一个比较简单的爬虫,能爬取到网页内容,并保存下来即可,并不追求性能。步骤
- 构造http请求
- 接收http响应
- 字符集转换unicode转utf8
- json解析
- 保存数据到csv文件(这里为了处理方便,采用csv文件)
上代码
package main
import (
"bytes"
"github.com/tidwall/gjson"
"io/ioutil"
"os"
"strconv"
"strings"
"time"
"unsafe"
"fmt"
"log"
"net/http"
)
/*筛选条件:岗位, 公司,地点, 学历, 学校*/
func ParseCompay(company string) {
//var bodyStr = "content=字节&search_priority=1&ordertype=2&part_school=&year=&education=全部&access_token=%24ytkzhLIvv5%2BsYwytrpIDkg26d4HQpxjr6pCoffershowzju1qaz.1626486167795.e7c54cdc68aac7b8dcedff91f09e204c"
queryStr := "content=" + company + "&search_priority=1&ordertype=2&part_school=&year=&education=全部&access_token=%24ytkzhLIvv5%2BsYwytrpIDkg26d4HQpxjr6pCoffershowzju1qaz.1626486167795.e7c54cdc68aac7b8dcedff91f09e204c"
client := &http.Client{}
req, err := http.NewRequest("POST", "https://www.ioffershow.com/webapi/v2/search_salary", bytes.NewBufferString(queryStr))
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36 MicroMessenger/7.0.9.501 NetType/WIFI MiniProgramEnv/Windows WindowsWechat")
req.Header.Set("content-type", "application/x-www-form-urlencoded")
req.Header.Set("Referer", "https://servicewechat.com/wx67fbba6cd94591e4/44/page-frame.html")
req.Header.Set("Accept-Encoding", "gzip, deflate, br")
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
fmt.Errorf("wrong status code",
resp.StatusCode)
}
respBytes, err := ioutil.ReadAll(resp.Body)
str := (*string)(unsafe.Pointer(&respBytes))
result := unicode2utf8(*str)
fmt.Println(result)
infoList := gjson.Get(result,"info")
if infoList.Exists(){
/*open file: company_date_time.csv*/
timeNow := time.Now()
fileName := company + "_" + timeNow.Format("20060102150405") + ".csv"
file , err := os.Create(fileName)
if err != nil {
fmt.Println("open file failed!")
return
}
defer file.Close()
file.WriteString("id, 公司, 职位, 工资, 城市, ip, 行业, 学历, 工资类型, 工资上限, 工资下限, 可信度, 时间, 附加\r\n")
re:= infoList.Array()
for _, v := range re {
id := v.Get("id").String()
company :=v.Get("company").String()
position := v.Get("position").String()
salary := v.Get("salary").String()
city := v.Get("city").String()
remark := v.Get("remark").String()
ip := v.Get("ip").String()
hangye := v.Get("hangye").String()
xueli := v.Get("xueli").String()
salarytype := v.Get("salarytype").String()
salary_upper := v.Get("salary_upper").String()
salary_lower := v.Get("salary_lower").String()
is_delete := v.Get("is_delete").String()
score := v.Get("score").String()
time := v.Get("time").String()
println("=================================")
println("id: ", id)
println("company: ", company)
println("position: ", position)
println("salary: ", salary)
println("city: ", city)
println("remark: ", remark)
println("ip: ", ip)
println("hangye: ", hangye)
println("xueli: ", xueli)
println("salarytype: ", salarytype)
println("salary_upper: ", salary_upper)
println("salary_lower: ", salary_lower)
println("is_delete: ", is_delete)
println("score: ", score)
println("time: ", time)
println("=================================")
/*保存到文件中*/
file.WriteString( id + "," + company + "," + position + "," + salary + "," + city + "," + ip + "," + hangye + "," + xueli +
"," + salarytype + "," + salary_upper + "," + salary_lower + "," + score + "," + time + "," + remark + "\r\n")
}
}
}
func unicode2utf8(source string) string {
var res = []string{""}
sUnicode := strings.Split(source, "\\u")
var context = ""
for _, v := range sUnicode {
var additional = ""
if len(v) < 1 {
continue
}
if len(v) > 4 {
rs := []rune(v)
v = string(rs[:4])
additional = string(rs[4:])
}
temp, err := strconv.ParseInt(v, 16, 32)
if err != nil {
context += v
}
context += fmt.Sprintf("%c", temp)
context += additional
}
res = append(res, context)
return strings.Join(res, "")
/*json*/
}
func main() {
companyList := []string{"百度", "字节", "腾讯", "阿里", "tplink"}
for _, campany:= range companyList {
ParseCompay(campany)
}
}
golang自带的json库只能解析固定struct的json,使用起来还是很不方便的,这里json解析采用的是gjson。
可以使用命令下载
go get github.com/tidwall/gjson