python图形验证码 python验证码处理

转载

mob64ca140ce312 2023-09-14 17:14:06

文章标签 python图形验证码验证码 html python 文章分类 Python 后端开发

文章目录

1. 加载验证码图像

Pillow和PIL的对比

2. 光学字符识别抽取文本
3. 处理复杂的验证码
4. 参考文献

本节，将与网页进行交互，根据用户输入返回对应的内容。

发送POST请求提交表单；
使用cookie登陆网站；
用于简化表单提交的高级模块Mechanize。

1. 加载验证码图像

在分析验证码之前，首先需要从表单中获取该图像。

要注意这个图像是从其它url加载过来的还是嵌入在网页中的。
为了在python中处理该图像，需要用到Pillow包。

pip3 install Pillow

在Pillow库中提供了一个便捷的Image类，包含了很多用于处理验证码图像的高级方法。

Pillow将验证码图像处理之后才会用于提取验证码中的文本。提取文本需要使用光学字符识别（Optical Character Recongnition， OCR），目前使用的较多的OCR是google的开源Tesseract OCR引擎。

通常情况下，验证码文本一般都是黑色的，背景则会更加明亮，所以我们可以通过检查像素是否为黑色将文本分离出来，该处理过程称作阈值化。

Pillow和PIL的对比

Pillow是知名的python图像处理库（Python Image Library，PIL）的分支版本，PIL已经不再维护了。就是由于PIL仅支持到Python 2.7，加上年久失修，于是一群志愿者在PIL的基础上创建了兼容的版本，名字叫Pillow，支持最新Python 3.x，又加入了许多新特性。

2. 光学字符识别抽取文本

光学字符识别（Optical Character Recongnition， OCR）用于从图像中抽取文本，目前使用的较多的OCR是google的开源Tesseract OCR引擎。

注意：因为Tesseract的设计初衷是抽取更加典型的文本，比如背景统一的书页。如果想要更加有效地使用Tesseract，需要先修改验证码图像，去除其中的背景噪声，只保留文本部分。

在安装下面这个python包之前还需要安装Tesseract OCR引擎。

pip3 install pytesseract

示例代码

# form.py
# -*- coding: utf-8 -*-

import urllib
import http.cookiejar as cookielib
from io import BytesIO
import lxml.html
from PIL import Image # pillow


REGISTER_URL = 'http://example.webscraping.com/user/register'


def extract_image(html):
	"""
	从html中抽取出图片
	"""
	# 使用lxml将html转成string
    tree = lxml.html.fromstring(html)
    # 获取到验证码图片中的src属性
    img_data = tree.cssselect('div#recaptcha img')[0].get('src')
    # remove data:image/png;base64, header
    # str.partition(sep)，以sep字符串分割str字符串为3个元素的元组，移除src属性值中的base64字符串
    img_data = img_data.partition(',')[-1]
    #open('test_.png', 'wb').write(data.decode('base64'))
    # 以"base64"解码图片，获得二进制数据
    binary_img_data = img_data.decode('base64')
    # 再使用BytesIO对这个二进制数据进行封装
    file_like = BytesIO(binary_img_data)
    img = Image.open(file_like)
    #img.save('test.png')
    return img

def parse_form(html):
    """extract all input properties from the form
    从表单中抽取出input标签的所有属性和其对应的值
    """
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

def register(first_name, last_name, email, password, captcha_fn):
	"""
	下载注册页面，抽取表单，设置好用户名，密码，邮箱。抽取验证码图像，提取出文本，将结果添加到表单中提交至服务器。
	自动化注册
	"""
 	# cookieJar的一个实例对象，用于存cookie
    cj = cookielib.CookieJar()
    # 以处理这个cookiejar对象的处理器，创建一个opener
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    # 获取指定url的html
    html = opener.open(REGISTER_URL).read()
    # 从html中抽取出input标签属性的键值对
    form = parse_form(html)
    # 添加部分属性键值对
    form['first_name'] = first_name
    form['last_name'] = last_name
    form['email'] = email
    form['password'] = form['password_two'] = password
   	# 从html中抽取出验证码的图像，并做一定的处理
    img = extract_image(html)
    # 从验证码图像中抽取出文本
    captcha = captcha_fn(img)
    # 将验证码图像中文本的值存入表单
    form['recaptcha_response_field'] = captcha
    # 对表单数据编码
    encoded_data = urllib.parse.urlencode(form)
    # 使用表单数据以POST方式请求注册url，并获取响应
    request = urllib.request.Request(REGISTER_URL, encoded_data)
    response = opener.open(request)
    # 只要响应的url不为注册url，则表明注册请求成功
    success = '/user/register' not in response.geturl()
    return success

# ocr.py
# -*- coding: utf-8 -*-

import csv
import string
from PIL import Image
import pytesseract
from form import register


REGISTER_URL = 'http://example.webscraping.com/user/register'


def main():
    print register('Test Account', 'Test Account', 'example@webscraping.com', 'example', ocr)


def ocr(img):
	"""
	提取的验证码中图像的文本
	"""
	# threshold the image to ignore background and keep text
	# 转化为灰度图
    gray = img.convert('L')
    #gray.save('captcha_greyscale.png')
    # 只有阈值小于1个像素才会保留，也即是说，只有全黑的像素才会保留下来
    bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
    #bw.save('captcha_threshold.png')
    word = pytesseract.image_to_string(bw)
    ascii_word = ''.join(c for c in word if c in string.letters).lower()
    return ascii_word


def test_samples():
    """Test accuracy of OCR on samples images
    测试准确率
    """
    correct = total = 0
    for filename, text in csv.reader(open('samples/samples.csv')):
        img = Image.open('samples/' + filename)
        if ocr(img) == text:
            correct += 1
        total += 1
    print 'Accuracy: %d/%d' % (correct, total)


if __name__ == '__main__':
    main()

3. 处理复杂的验证码

上面的代码只能用来识别出较为明显的验证码图片，但是如果网站使用的是更加复杂的验证码，比如文本字体的颜色和背景颜色相近没有那么容易区分开，而且文本不是水平，需要我们按照一定的角度旋转。

当图像中的噪声过多的话，直接使用tesseract来处理，得到正确验证码的概率要低得多，这就需要我们对图像做更多的处理，清除噪声。

对于验证码而言，如果验证码中的图像较为复杂，通常的方式是使用验证码处理服务（也就是一般说的打码平台）。这种平台一般都是收费的，好像也有一些免费的打码平台。

其流程一般是，

1.将验证码图像传给打码平台的API
2.打码平台会有人进行人工查看，并在HTTP响应汇总给出解析后的文本。

4. 参考文献

[1]《用python写web爬虫(web scraping with python)》 [2] python3使用Pillow、tesseract-ocr与pytesseract模块的图片识别 [3] 【python 图像识别】图像识别从菜鸟走向大神系列1 [4] python人工智能-图像识别 [5] pytesseract [6] 廖雪峰 python教程 pillow

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。