一、前言
笔者在学习网站信息收集的过程中,在子域名收集这块,遇到了很多需要手工整理操作的内容,由于手工整理太耗费时间,硬糖师傅教导我用Python语言来自动化操作,以下记录自己学习Python爬取网站页面内容的小过程。
二、学习过程
1.开发工具:
Python版本:3.7.1
相关模块:
requests模块 #因为要请求网站,所以用requests模块
pymysql模块 #因为暂时只会mysql,所以用pymysql模块
json模块 #因为页面是返回json结果,所以用json模块
2.原理简介
首先,打开搜集子域名的网站,我们只需要Domain的内容
https://securitytrails.com/list/keyword/huazhu
然后打开开发者模式Network模块,可以发现请求以下链接可以返回页面的json内容:
请求该链接需要带上的参数包括:
1.page: 2 #参数含义是第几页
2.keyword: huazhu #查询的网站的域名
3._csrf_token:"IgRzIV9aUkdMEjY3AyZwbGMlRhEJAg==" #声明csrf_token,必须带上
3.根据分析的结果,写一个小程序来自动获取查询到的子域名并且插入mysql数据库。
import requests
import json
import pymysql
# 获取页面内容的域名
def get_domains():
# 请求头
headers = {
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "https://securitytrails.com/app/api/v1/list?page=1&apex_domain=huazhu.com",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
"Cookie": "_ga=GA1.2.1769536690.1573528145; _vwo_uuid_v2=D1D2C0B1A59E881EA448125F0EB5537AF|37e1a39226ed536212a53b0d0bb1da1d; _vwo_uuid=D1D2C0B1A59E881EA448125F0EB5537AF; __adroll_fpc=cf0757dd879e87ebef9f41c1d828f5ad-s2-1573528177608; _fbp=fb.1.1573529244781.653288129; __stripe_mid=1945abe2-154a-461f-8f8a-0b9a5dbe1501; _vwo_ds=3%3Aa_0%2Ct_0%3A0%241573528144%3A23.26423306%3A%3A%3A21_0%2C14_0%2C12_0%2C11_0%2C6_0%2C3_0%3A1; driftt_aid=62aa77aa-55a7-4556-a065-5255a642e7f7; DFTT_END_USER_PREV_BOOTSTRAPPED=true; driftt_eid=zcboy95%40gmail.com; _gid=GA1.2.503009466.1575253442; _gat=1; _vis_opt_s=13%7C; _vis_opt_test_cookie=1; mp_679f34927f7b652f13bda4e479a7241d_mixpanel=%7B%22distinct_id%22%3A%20%22u_3a74dd33-16ee-4d82-b9f1-f29b2b5b73f2%22%2C%22%24device_id%22%3A%20%2216e5da6e3d3cd3-0f577e8a3fd7f3-1c3a6a5b-13c680-16e5da6e3d4bd6%22%2C%22%24initial_referrer%22%3A%20%22https%3A%2F%2Fsecuritytrails.com%2Fdns-trails%22%2C%22%24initial_referring_domain%22%3A%20%22securitytrails.com%22%2C%22app%22%3A%20%22SecurityTrails%22%2C%22utm_source%22%3A%20%22st-app%22%2C%22utm_medium%22%3A%20%22cta-bottom%22%2C%22%24user_id%22%3A%20%22u_3a74dd33-16ee-4d82-b9f1-f29b2b5b73f2%22%7D; _gat_gtag_UA_108439842_1=1; _vwo_sn=1725298%3A2; _securitytrails_app=QTEyOEdDTQ.wQYOTJFM-W2V69_qmmeNGGbVP3b_sljSiHE86hWrjsEP5L1N5VDUWsT5H_c.EDyVDlFfHvchMMrz.aI_2WeY6oOKDHLo6rRi_jGLHmd7Sscuefg_AF5mt0AO-ZchowxTKotISlzdmaId09SVRJx_Hwb-q-jV5J_bLw6Db4fs3DxzTN4UHqiBuvw.0CEs66huDsOhgun67uT_Vw; driftt_sid=74886877-96c9-4936-a445-2a5a9bf63f02; __ar_v4=DISBUDHYAZAKNC7GVZRXHU%3A20191112%3A53%7CK4MIVIZDAZFQJNYFCCLGOP%3A20191112%3A53%7CGDFF5LAGC5AWTKDNHLCMI5%3A20191112%3A53"
}
# csrf_token
data = {
"_csrf_token": "NRQ3J15TACR1PnUiAFJ9A0cWQR8hAAAArruln8kWAm9LW6NWuD6KKg=="
}
domain_lists = []
# 循环获取到的页面域名添加到domain_lists
for i in range(1,4):
url = "https://securitytrails.com/app/api/v1/list?page=%s&apex_domain=huazhu.com"%i
result = requests.post(url, data=data, headers=headers)
content = result.json()['records']
for i in content:
hostname = i['hostname']
domain_lists.append(hostname)
return domain_lists
#根据内容插入到mysql数据库
def insert(domains):
db = pymysql.connect('localhost', '账户', '密码', 'test')
curosr = db.cursor()
for i in domains:
sql = "INSERT INTO huazhu (hostname) VALUES('%s')"%i
curosr.execute(sql)
db.commit()
row = curosr.rowcount
if row == 0:
print('%s 插入失败')
continue
db.close()
if __name__ == "__main__":
domains = get_domains()
insert(domains)
print('done')
三、效果展示
四、总结
因为刚起步没什么技术含量,记录自己学习操作的小过程。
daydayup