一、前言

笔者在学习网站信息收集的过程中,在子域名收集这块,遇到了很多需要手工整理操作的内容,由于手工整理太耗费时间,硬糖师傅教导我用Python语言来自动化操作,以下记录自己学习Python爬取网站页面内容的小过程。

二、学习过程

1.开发工具:

Python版本:3.7.1

相关模块:

requests模块 #因为要请求网站,所以用requests模块

pymysql模块 #因为暂时只会mysql,所以用pymysql模块

json模块 #因为页面是返回json结果,所以用json模块

2.原理简介

首先,打开搜集子域名的网站,我们只需要Domain的内容

https://securitytrails.com/list/keyword/huazhu

python 爬虫自动翻页 python爬页面_mysql

然后打开开发者模式Network模块,可以发现请求以下链接可以返回页面的json内容:

python 爬虫自动翻页 python爬页面_json_02

 请求该链接需要带上的参数包括:

1.page: 2 #参数含义是第几页

2.keyword: huazhu #查询的网站的域名

3._csrf_token:"IgRzIV9aUkdMEjY3AyZwbGMlRhEJAg=="  #声明csrf_token,必须带上

3.根据分析的结果,写一个小程序来自动获取查询到的子域名并且插入mysql数据库。

import requests
import json
import pymysql

# 获取页面内容的域名
def get_domains():
    # 请求头
    headers = {
        "Content-Type": "application/x-www-form-urlencoded",
        "Referer": "https://securitytrails.com/app/api/v1/list?page=1&apex_domain=huazhu.com",
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
        "Cookie": "_ga=GA1.2.1769536690.1573528145; _vwo_uuid_v2=D1D2C0B1A59E881EA448125F0EB5537AF|37e1a39226ed536212a53b0d0bb1da1d; _vwo_uuid=D1D2C0B1A59E881EA448125F0EB5537AF; __adroll_fpc=cf0757dd879e87ebef9f41c1d828f5ad-s2-1573528177608; _fbp=fb.1.1573529244781.653288129; __stripe_mid=1945abe2-154a-461f-8f8a-0b9a5dbe1501; _vwo_ds=3%3Aa_0%2Ct_0%3A0%241573528144%3A23.26423306%3A%3A%3A21_0%2C14_0%2C12_0%2C11_0%2C6_0%2C3_0%3A1; driftt_aid=62aa77aa-55a7-4556-a065-5255a642e7f7; DFTT_END_USER_PREV_BOOTSTRAPPED=true; driftt_eid=zcboy95%40gmail.com; _gid=GA1.2.503009466.1575253442; _gat=1; _vis_opt_s=13%7C; _vis_opt_test_cookie=1; mp_679f34927f7b652f13bda4e479a7241d_mixpanel=%7B%22distinct_id%22%3A%20%22u_3a74dd33-16ee-4d82-b9f1-f29b2b5b73f2%22%2C%22%24device_id%22%3A%20%2216e5da6e3d3cd3-0f577e8a3fd7f3-1c3a6a5b-13c680-16e5da6e3d4bd6%22%2C%22%24initial_referrer%22%3A%20%22https%3A%2F%2Fsecuritytrails.com%2Fdns-trails%22%2C%22%24initial_referring_domain%22%3A%20%22securitytrails.com%22%2C%22app%22%3A%20%22SecurityTrails%22%2C%22utm_source%22%3A%20%22st-app%22%2C%22utm_medium%22%3A%20%22cta-bottom%22%2C%22%24user_id%22%3A%20%22u_3a74dd33-16ee-4d82-b9f1-f29b2b5b73f2%22%7D; _gat_gtag_UA_108439842_1=1; _vwo_sn=1725298%3A2; _securitytrails_app=QTEyOEdDTQ.wQYOTJFM-W2V69_qmmeNGGbVP3b_sljSiHE86hWrjsEP5L1N5VDUWsT5H_c.EDyVDlFfHvchMMrz.aI_2WeY6oOKDHLo6rRi_jGLHmd7Sscuefg_AF5mt0AO-ZchowxTKotISlzdmaId09SVRJx_Hwb-q-jV5J_bLw6Db4fs3DxzTN4UHqiBuvw.0CEs66huDsOhgun67uT_Vw; driftt_sid=74886877-96c9-4936-a445-2a5a9bf63f02; __ar_v4=DISBUDHYAZAKNC7GVZRXHU%3A20191112%3A53%7CK4MIVIZDAZFQJNYFCCLGOP%3A20191112%3A53%7CGDFF5LAGC5AWTKDNHLCMI5%3A20191112%3A53"
    }
    # csrf_token
    data = {
        "_csrf_token": "NRQ3J15TACR1PnUiAFJ9A0cWQR8hAAAArruln8kWAm9LW6NWuD6KKg=="
    }

    domain_lists = []
    # 循环获取到的页面域名添加到domain_lists
    for i in range(1,4):
        url = "https://securitytrails.com/app/api/v1/list?page=%s&apex_domain=huazhu.com"%i
        result = requests.post(url, data=data, headers=headers)
        content = result.json()['records']
        for i in content:
            hostname = i['hostname']
            domain_lists.append(hostname)
    return domain_lists

#根据内容插入到mysql数据库 
def insert(domains):
    db = pymysql.connect('localhost', '账户', '密码', 'test')
    curosr = db.cursor()
    for i in domains:
        sql = "INSERT INTO huazhu (hostname) VALUES('%s')"%i
        curosr.execute(sql)
        db.commit()
        row = curosr.rowcount
        if row == 0:
            print('%s 插入失败')
            continue
    db.close()

if __name__ == "__main__":
    domains = get_domains()
    insert(domains)
    print('done')

三、效果展示

python 爬虫自动翻页 python爬页面_python 爬虫自动翻页_03

 

四、总结

因为刚起步没什么技术含量,记录自己学习操作的小过程。

 

 

 

 

daydayup