Python小红书刷广告

转载

daleiwang 2024-09-14 09:19:39

文章标签 Python小红书刷广告 python html HTTP 验证码 文章分类 Python 后端开发

爬虫细节篇

爬虫初级

基础知识

合法爬虫
各类语法
原理

实现小功能

爬虫进阶

小项目

写一个登录豆瓣的客户端
编写一个爬虫，爬百度百科“网络爬虫”的词条（链接 -> http://baike.baidu.com/view/284853.htm），将所有包含“view”的链接按下边格式打印出来。
直接打印词条名和链接不算什么真本事儿，这题要求你的爬虫允许用户输入搜索的关键词。
先打印 10 个链接，然后问下用户“您还往下看吗？

爬虫具体项目另附

爬虫初级

基础知识

URI 是统一资源标识符（Universal Resource Identifier），
URL 是统一资源定位符（Universal Resource Locator）。
用一句话概括它们的区别：
URI 是用字符串来标识某一互联网资源，而 URL 则是表示资源的地址（我们说某个网站的网址就是 URL），
因此 URI 属于父类，而 URL 属于 URI 的子类。

合法爬虫

在网站的根目录下创建并编辑 robots.txt 文件，用于表明您不希望搜索引擎抓取工具访问您网站上的哪些内容。此文件使用的是 Robots 排除标准，该标准是一项协议，所有正规搜索引擎的蜘蛛均会遵循该协议爬取。既然是协议，那就是需要大家自觉尊重，所以该协议一般对非法爬虫无效。

各类语法

urllib.request.urlopen()
返回的是一个HTTPResponse的实例对象，它属于http.client模块。

>>> response = urllib.request.urlopen("http://www.fishc.com")
>>> type(response)
<class 'http.client.HTTPResponse'>

urlopen() 方法的 timeout 参数用于设置连接的超时时间，单位是秒。
decode 的作用是将其他编码的字符串转换成 unicode 编码，相反，encode 的作用是将 unicode 编码转换成其他编码的字符串。
urlopen 函数有一个 data 参数，如果给这个参数赋值，那么 HTTP 的请求就是使用 POST 方式；如果 data 的值是 NULL，也就是默认值，那么 HTTP 的请求就是使用 GET 方式。

简单来说 cookie 可以分成两类：
1、一类是即时过期的 cookies，称为“会话” cookies，当浏览器关闭时（这里是 python 的请求程序）自动清除
2、另一类是有期限的 cookies，由浏览器进行存储，并在下一次请求该网站时自动附带（如果没过期或清理的话）

属性	功能
User-Agent	普通浏览器会通过该内容向访问网站提供你所使用的浏览器类型、操作系统、浏览器内核等信息的标识。
JSON	一种轻量级的数据交换格式，说白了这里就是用字符串把 Python 的数据结构封装起来，便与存储和使用。
User-Agent	普通浏览器会通过该内容向访问网站提供你所使用的浏览器类型、操作系统、浏览器内核等信息的标识。

原理

HTTP 有好几种方法（GET，POST，PUT，HEAD，DELETE，OPTIONS，CONNECT），使用 get_method() 方法获取 Request 对象具体使用哪种方法访问服务器。最常用的无非就是 GET 和 POST 了，当 Request 的 data 参数被赋值的时候，get_method() 返回 ‘POST’，否则一般情况下返回 ‘GET’。

实现小功能

下载网页的前几百字字符

>>> import urllib.request
>>> response = urllib.request.urlopen('http://www.fishc.com')
>>> print(response.read(300))

从 urlopen() 返回的对象中获取 HTTP 状态码

response = urllib.request.urlopen(url)
code = response.getcode()

写一个程序，检测指定 URL 的编码。

import urllib.request
import chardet

def main():
    url = input("请输入URL：")

    response = urllib.request.urlopen(url)
    html = response.read()

    # 识别网页编码
    encode = chardet.detect(html)['encoding']
    if encode == 'GB2312':
        encode = 'GBK'

    print("该网页使用的编码是：%s" % encode)
        
if __name__ == "__main__":
    main()

写一个程序，依次访问文件中指定的站点，并将每个站点返回的内容依次存放到不同的文件中。

import urllib.request
import chardet

def main():
    i = 0
    
    with open("urls.txt", "r") as f:
        # 读取待访问的网址
        # 由于urls.txt每一行一个URL
        # 所以按换行符'\n'分割
        urls = f.read().splitlines()
        
    for each_url in urls:
        response = urllib.request.urlopen(each_url)
        html = response.read()

        # 识别网页编码
        encode = chardet.detect(html)['encoding']
        if encode == 'GB2312':
            encode = 'GBK'
        
        i += 1
        filename = "url_%d.txt" % i

        with open(filename, "w", encoding=encode) as each_file:
            each_file.write(html.decode(encode, "ignore"))

if __name__ == "__main__":
    main()

爬虫进阶

小项目

写一个登录豆瓣的客户端

Python 2
1、urllib 和 urllib2 合并，大多数功能放入了 urllib.request 模块；
2、原来的 urllib.urlencode() 变为 urllib.parse.urlencode().encode()，由于编码的关系，你还需要在后边加上 encode(‘utf-8’)；
3、cookielib 被改名为 http.cookiejar；
4、CookieJar 是 Python 用于存放 cookie 的对象。

import re
import urllib.request
from http.cookiejar import CookieJar

# 豆瓣的登录url 
loginurl = 'https://www.douban.com/accounts/login'
cookie = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor)
 
data = {
    "form_email":"your email",
    "form_password":"your password",
    "source":"index_nav"
}
data = {}
data['form_email'] = '你的账号'
data['form_password'] = '你的密码'
data['source'] = 'index_nav'

response = opener.open(loginurl, urllib.parse.urlencode(data).encode('utf-8'))

#验证成功跳转至登录页
if response.geturl() == "https://www.douban.com/accounts/login":
    html = response.read().decode()
    
    #验证码图片地址
    imgurl = re.search('<img id="captcha_image" src="(.+?)" alt="captcha" class="captcha_image"/>', html)
    if imgurl:
        url = imgurl.group(1)
        # 将验证码图片保存至同目录下
        res = urllib.request.urlretrieve(url, 'v.jpg')

        # 获取captcha-id参数
        captcha = re.search('<input type="hidden" name="captcha-id" value="(.+?)"/>' ,html)

        if captcha:
            vcode = input('请输入图片上的验证码：')
            data["captcha-solution"] = vcode
            data["captcha-id"] = captcha.group(1)
            data["user_login"] = "登录"

            # 提交验证码验证
            response = opener.open(loginurl, urllib.parse.urlencode(data).encode('utf-8'))

            # 登录成功跳转至首页 '''
            if response.geturl() == "http://www.douban.com/":
                print('登录成功！')

编写一个爬虫，爬百度百科“网络爬虫”的词条（链接 -> http://baike.baidu.com/view/284853.htm），将所有包含“view”的链接按下边格式打印出来。

import urllib.request
import re
from bs4 import BeautifulSoup

def main():
    url = "http://baike.baidu.com/view/284853.htm"
    response = urllib.request.urlopen(url)
    html = response.read()
    soup = BeautifulSoup(html, "html.parser") # 使用 Python 默认的解析器
    
    for each in soup.find_all(href=re.compile("view")):
        print(each.text, "->", ''.join(["http://baike.baidu.com", each["href"]]))
        # 上边用 join() 不用 + 直接拼接，是因为 join() 被证明执行效率要高很多

if __name__ == "__main__":
    main()

直接打印词条名和链接不算什么真本事儿，这题要求你的爬虫允许用户输入搜索的关键词。

import urllib.request
import urllib.parse
import re 
from bs4 import BeautifulSoup

def main():
    keyword = input("请输入关键词：")
    keyword = urllib.parse.urlencode({"word":keyword})
    response = urllib.request.urlopen("http://baike.baidu.com/search/word?%s" % keyword)
    html = response.read()
    soup = BeautifulSoup(html, "html.parser")

    for each in soup.find_all(href=re.compile("view")):
        content = ''.join([each.text])
        url2 = ''.join(["http://baike.baidu.com", each["href"]])
        response2 = urllib.request.urlopen(url2)
        html2 = response2.read()
        soup2 = BeautifulSoup(html2, "html.parser")
        if soup2.h2:
            content = ''.join([content, soup2.h2.text])
        content = ''.join([content, " -> ", url2])
        print(content)

if __name__ == "__main__":
    main()

先打印 10 个链接，然后问下用户“您还往下看吗？

import urllib.request
import urllib.parse
import re 
from bs4 import BeautifulSoup

def test_url(soup):
    result = soup.find(text=re.compile("百度百科尚未收录词条"))
    if result:
        print(result[0:-1]) # 百度这个碧池在最后加了个“符号，给它去掉
        return False
    else:
        return True

def summary(soup):
    word = soup.h1.text
    # 如果存在副标题，一起打印
    if soup.h2:
        word += soup.h2.text
    # 打印标题
    print(word)
    # 打印简介
    if soup.find(class_="lemma-summary"):
        print(soup.find(class_="lemma-summary").text)   

def get_urls(soup):
    for each in soup.find_all(href=re.compile("view")):
        content = ''.join([each.text])
        url2 = ''.join(["http://baike.baidu.com", each["href"]])
        response2 = urllib.request.urlopen(url2)
        html2 = response2.read()
        soup2 = BeautifulSoup(html2, "html.parser")
        if soup2.h2:
            content = ''.join([content, soup2.h2.text])
        content = ''.join([content, " -> ", url2])
        yield content
        

def main():
    word = input("请输入关键词：")
    keyword = urllib.parse.urlencode({"word":word})
    response = urllib.request.urlopen("http://baike.baidu.com/search/word?%s" % keyword)
    html = response.read()
    soup = BeautifulSoup(html, "html.parser")

    if test_url(soup):
        summary(soup)
        
        print("下边打印相关链接：")
        each = get_urls(soup)
        while True:
            try:
                for i in range(10):
                    print(next(each))
            except StopIteration:
                break
            
            command = input("输入任意字符将继续打印，q退出程序：")
            if command == 'q':
                break
            else:
                continue
    
if __name__ == "__main__":
    main()