文档扫描OCR识别python代码

转载

mob64ca1417736e 2024-09-14 11:24:16

文章标签 文档扫描OCR识别python代码 python php 控件 INI 文章分类 Python 后端开发

Python工具开源专栏

Py0001 python+php 制作C/S架构的PDF文字识别小工具

Python工具开源专栏
前言

开发环境
目录结构
部分演示

Python部分

使用tkinter创建主窗体

tkinter添加控件
设置控件在窗体中的位置

读取本地配置文件
提取PDF的图片

调用Ocr识别图片的文字
Python发送post请求给后端php

后端PHP部分

PHP连接MySQL

MySQL新增数据
刷新Ocr的免费次数

PHP自定义接收Request请求体
MySQL数据表结构

完整代码已在GitHub上开源

前言

这个需求来源于学校，需要批量提取PDF文件里的文字，对PDF里的图片进行文字识别。
当时做了一个非常简易的版本，只能实现一对一的图片文字识别。第一版没有备份，技术要点是面向过程编程、GUI编程、解析PDF、百度文字识别等，Python使用了这些第三方包：tkinter, pyinstaller, baidu-aip, pdfminer, pdfminer3k, fitz... 整体上参考了该博主的文章

现在做版本迭代，重制了第二版，改成面向对象编程，均用Class类来实现功能，可实现一对多的图片文字识别、PDF转txt，后端PHP的数据库记录。因为百度AI的QPS限制了高并发(免费用户就2次QPS)，所以没用多线程开发。

开发环境

操作系统： Windows 10 LTSC (2018版)
Python版本： 3.8.6 (Python3及以上版本均可)
程序应用环境： Windows系统下 x64和x86平台 (分别用pyinstaller打包，详情可看Python x64和x86平台下pyinstaller打包过程)
Python第三方包： tkinter, pyinstaller, baidu-aip, pdfminer, pdfminer3k, fitz, requests

目录结构

源代码目录结构：

convertPDF
 |——— api
      |——— index.php	# 后端php代码
 |
 |——— src
      |——— app.ini		# 存放Ocr接口的用户参数和后端的api接口地址
      |——— myConvert.py	# PDF转图片，PDF转文字的模块
      |——— pdf_Ocr.py	# PDF转图片文字识别模块
      |——— tkPDF.py		# 主程序入口文件

执行文件的目录结构：

—— src  (x64 or x86)
 |——— app.ini		# 存放Ocr接口的用户参数和后端的api接口地址
 |——— tkPDF.exe		# 主程序入口
 |

部分演示

主界面：

文档扫描OCR识别python代码_INI

PDF识别结果

略

图片识别结果

略

Python部分

注：我用的vsCode，格式化工具flake8，但单行最大字符数是默认的70。参考过别人的文章尝试着修改，没有效果…

使用tkinter创建主窗体

tkPDF.py:

# coding=utf-8

import tkinter

class tkPDF:
    """
    创建PDF转换的Tk窗体
    width: 窗体宽度
    height: 窗体高度
    """
    
    window = tkinter.Tk()
    __title: str
    __width: int
    __height: int

    def __init__(self, title, width: int, height: int):
        self.__title = title
        self.__width = width
        self.__height = height

        self.window.title(self.__title)
        self.window.geometry('{0}x{1}'.format(self.__width, self.__height))
		
		...

        # 开始捕获窗体消息(动态更新)
        self.window.mainloop()

tkinter添加控件

tkPDF.py:

# coding=utf-8

import tkinter

class tkPDF:
    ...
    
    # 创建Button控件
    btn_pic = tkinter.Button(window,
                             font=('宋体', 11, 'normal'),
                             text='图片识别文字提取')
    btn_pdf_img = tkinter.Button(window,
                                 font=('宋体', 11, 'normal'),
                                 text='PDF(扫描版)转WORD文档')
    btn_pdf_word = tkinter.Button(window,
                                  font=('宋体', 11, 'normal'),
                                  text='PDF(文字版)转word文档')
    btn_refresh = tkinter.Button(window,
    							 font=('黑体', 9, 'bold'),
    							 text='再次刷新')

    # 创建Label控件
    label = tkinter.Label(window,
                          font=('黑体', 11, 'bold'),
                          justify='left',
                          text='今日的高精度识别次数：---\n\n今日的标准识别次数：-----')

设置控件在窗体中的位置

使用place的方式来设置控件在窗体中的位置。

部分方法：
root = tkinter.Tk()			# 实例化Tk窗体
btn = tkinter.Button(root)	# 实例化控件，挂载在root窗体上
btn.place(width, height)	# 定义控件的宽度、高度
btn.place(x, y)				# 相对于窗体左上角的x、y
btn.place_info()			# 获取控件的place参数，返回list

tkPDF.py:

# coding=utf-8

import tkinter

class tkPDF:
	...
	
    def __init__(self, title, width: int, height: int):
	    ...
        
        # 设置width, height
        self.btn_pic.place(width=150, height=40)
        self.btn_pdf_img.place(width=180, height=40)
        self.btn_pdf_word.place(width=180, height=40)
        self.btn_refresh.place(width=100, height=30)

        # 设置Label参数
        self.label.place(x=350, y=220)

        # 设置控件在窗体的位置
        _x = 60  # 左上角x
        _y = 100  # 左上角y
        margin = 30  # 控件之间的间距

        self.btn_pic.place(x=_x, y=_y)
        _x += int(self.btn_pic.place_info()['width'])  # 获取控件的width
        _x += margin

        self.btn_pdf_img.place(x=_x, y=_y)
        _x += int(self.btn_pdf_img.place_info()['width'])  # 获取控件的width
        _x += margin

        self.btn_pdf_word.place(x=_x, y=_y)
        self.btn_refresh.place(x=580, y=228)

读取本地配置文件

采用读取本地配置文件的方式，可以灵活地更换api接口。主要有2个列表：SERVER和APP_INFO 读取的路径是主程序所在目录下的app.ini文件。

app.ini:

[SERVER]
url=....

[APP_INFO]
app_id=....
api_key=....
secret_key=....

注：exe主程序可以添加快捷方式，复制到别的路径并不影响app.ini文件的读取

读取本地配置的函数如下。

get_env(self):

class tkPDF:
	...
	
	def get_env(self):
	    """
	    设置配置文件，添加配置参数
	    """
	    
	    try:
	        INI_file = os.path.join(os.path.dirname(sys.argv[0]), r'app.ini')  # 读取主程序目录下的app.ini
	        INI_read = open(INI_file, 'r').read()
	        INI_rows = INI_read.split('\n')
	        for i in range(len(INI_rows)):
	            if INI_rows[i].find('[SERVER]') > -1:
	                self.__Url = INI_rows[i + 1].split('=')[1]  # 获取url值
	            elif INI_rows[i].find('[APP_INFO]') > -1:
	                self.__APP_ID = INI_rows[i + 1].split('=')[1]  # 获取app_id值
	                self.__API_KEY = INI_rows[i + 2].split('=')[1]  # 获取api_key值
	                self.__SECRET_KEY = INI_rows[i + 3].split('=')[1]  # 获取secret_key值
	    except BaseException:
	        messagebox.showinfo('提示', '未能加载配置文件: app.ini ，请检查文件是否存在。')
	        sys.exit(0)

提取PDF的图片

使用os.mkdir()函数在self.filedir目录里生成img和pdf文件夹，调用fitz的函数提取PDF里的图片。

__parse_img(self, pdf_fullpath):

class ConvertPDF:
	...
	
	def __parse_img(self, pdf_fullpath):
	    """
	    单个解析PDF转图片, 调用Ocr转文字
	    """
	
	    pdf_basename = os.path.basename(
	        pdf_fullpath[:str(pdf_fullpath).rfind('.')])
	    doc = fitz.Document(pdf_fullpath)
	    # 新建存放img的文件夹
	    img_path = self.filedir + '/%s_img' % pdf_basename
	    if os.path.exists(img_path):
	        shutil.rmtree(img_path)
	    os.mkdir(img_path)
	    # 新建存放pdf的文件夹
	    pdf_path = self.filedir + '/pdf'
	    if not os.path.exists(pdf_path):
	        os.mkdir(pdf_path)
	    # 存放pdf文件
	    shutil.copy(pdf_fullpath, pdf_path)
	
	    pageCount = 100 / doc.pageCount
	    beilv = pageCount
	    for pg in range(doc.pageCount):
	        print('PDF转图片完成:' + str(round(pageCount, 2)) + '%')
	        pageCount += beilv
	        page = doc[pg]
	        rotate = int(0)
	        # 每个尺寸的缩放系数为2，这将为我们生成分辨率提高四倍的图像。
	        zoom_x = 2.0
	        zoom_y = 2.0
	        trans = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
	        pm = page.getDisplayList().getPixmap(matrix=trans, alpha=False)
	        pm.writePNG(img_path + '/%s.png' % pg)

调用Ocr识别图片的文字

使用百度AI的Ocr文字识别的高精度版: basicAccurate 和标准版: basicGeneral

pdf_Ocr.py:

# coding=utf-8

import os
import threading
import time

import aip.ocr

class pdf_Ocr:
    """
    提供PDF的Ocr
    """
    __Url: str
    __client: aip.ocr.AipOcr

    lock_ = threading.Lock()

    def __init__(self, client: aip.ocr.AipOcr, url: str):
        self.__client = client
        self.__Url = url

    def Ocr_img(self, img_path, save_path, filename: str = None, is_pdf: bool = False):
        """
        图片文字识别
        """

        # 自定义取小数点后两位的函数
        # r2 = lambda r: str(r)[:str(r).rfind('.') + 2 + 1]
        if is_pdf:
            mlist = os.listdir(img_path)
            suffix = mlist[0][-3:]
            imgs = [
                '%i.%s' % (i, suffix)
                for i in sorted([int(mi[:str(mi).rfind('.')]) for mi in mlist])
            ]
            file_basename = os.path.basename(filename)
            pdfsize = os.path.getsize(filename)
        else:
            imgs = os.listdir(img_path)
            file_basename = None
            pdfsize = None

        if is_pdf:  # 以pdf文件命名
            writefile = open(
                save_path +
                '/%s.txt' % file_basename[:str(file_basename).rfind('.')], 'a')
        else:
            writefile = None

        processCount = 100 / len(imgs)
        beilv = processCount
        Mymessage = ''
        Mycount = {}
        Mytype = ''
        pic_count = 0
        for imgfile in imgs:
            print('图片文字识别完成: ' + str(round(processCount, 2)) + '%')
            processCount += beilv

            if not is_pdf:  # 以pic图片命名
                writefile = open(
                    save_path +
                    '/图片_%s.txt' % imgfile[:str(imgfile).rfind('.')], 'w')

            with open(img_path + '/%s' % imgfile, 'rb') as img:
                read = img.read()
                # 调用Ocr函数
                message = None
                try:
                    self.lock_.acquire()
                    message = self.get_message(read, result['type'], 0)
                    self.lock_.release()
                except BaseException as e:
                    return {
                        'errCode': 400,
                        'message': e,
                        'result': '连接超时，请检查网络'
                    }
                for read in message.get("words_result"):
                    words = read.get("words")
                    writefile.write(words)
                    writefile.write('\n')
                writefile.write('\n\n%s\n\n' % ('-' * 70))
                Mymessage = result['message']
                Mycount = {
                    'bA': result['count']['bA'],
                    'bG': result['count']['bG']
                }
                Mytype = result['type']
                pic_count += 1

            if not is_pdf:
                writefile.close()

        if is_pdf:
            writefile.close()
        return {
            'errCode': 0,
            'message': Mymessage,
            'count': Mycount,
            'type': Mytype,
            'pic_count': pic_count
        }

    def get_message(self, read, type, is_exit: bool):
        if is_exit:
            if type == 'basicAccurate':
                return self.__client.basicAccurate(read)
            else:
                return self.__client.basicGeneral(read)
        else:
            try:
                if type == 'basicAccurate':
                    return self.__client.basicAccurate(read)
                else:
                    return self.__client.basicGeneral(read)
            except BaseException:
                # 停留2秒再执行一次,再出错则退出
                time.sleep(2)
                return self.get_message(read, type, True)

Python发送post请求给后端php

pdf_Ocr.py:

import requests

class pdf_Ocr:
	...

	def Ocr_img(self, img_path, save_path, filename: str = None, is_pdf: bool = False):
		...
	
		# 发送给数据库记录次数
        headers = {
            'Content-Type': 'application/json',
            'Connection': 'close'
        }
        if is_pdf:
            data = {
                'filename': filename,
                'filesize': pdfsize,
                'count': 1,
                'a': self.__client._appId,
                'nf': imgfile
            }
        else:
            data = {
                'filename': img_path + '/%s' % imgfile,
                'filesize':
                os.path.getsize(img_path + '/%s' % imgfile),
                'count': 1,
                'a': self.__client._appId
            }
        res = requests.post(url=self.__Url, json=data, headers=headers)
        try:
            result = res.json()
        except BaseException as e:
            return {'errCode': 400, 'message': e, 'result': res.text}
        # 判断返回值
        if result['errCode'] != 0:
            return {
                'errCode': result['errCode'],
                'message': result['message'],
                'result': result['count']
            }

后端PHP部分

python+php+mysql实现C/S架构，后端的主要作用是实时保存用户上传的文件记录。

PHP连接MySQL

# 部分方法：
$conn = @mysqli_connect($host,$user,$pwd,$dbname,$port,$socket);	# 连接mysqli。@:忽略php的错误提示
mysqli_connect_error()	# 获取连接mysqli的错误描述，没有错误则返回Null
mysqli_connect_errno()	# 获取连接mysqli的错误代码，没有错误则返回0
$result = $conn->query($sql)	# 执行sql语句，返回$result结果集
mysqli_num_rows($result)	# 获取$result的行数
mysqli_fetch_assoc($result)	# 以关联数组的形式返回$result的一行
mysqli_close($conn)	# 关闭数据库连接

MySQL新增数据

# 增删改查的sql语句如下：
$insert = "INSERT INTO `tbname` (`col1, `col2`, `col3`) VALUES ('val1', 'val2', 'val3')";
$delete = "DELETE FROM `tbname` WHERE `id`='val'";
$update = "UPDATE `tbname` SET `col1`='val1', `col2`='val2' WHERE `id`='val'";
$select = "SELECT * FROM `tbname`";

index.php:

<?php
$conn = @mysqli_connect('127.0.0.1', 'root', 'root', 'pdfrecord', 3306, '');
if (mysqli_connect_error()) {
    $input['errCode'] = 403;
    $input['count'] = null;
    $input['message'] = '连接数据库失败';
    print_r(json_encode($input));	# 输出json格式的文本
} else {
    $query = "SELECT * FROM
        `index` WHERE
        `app_id`='$app_id'";
    $result = $conn->query($query);

    # 不存在则新建数据
    if (mysqli_num_rows($result) != 2) {
    	# 自定义sql语句
        $insertsql = "INSERT INTO `index`
        (`app_id`,`type`,`count`,`unix_time`) VALUES ";
        # 执行sql语句
        $conn->query($insertsql . "('$app_id','basicAccurate','0',UNIX_TIMESTAMP(NOW()))");
        $conn->query($insertsql . "('$app_id','basicGeneral','0',UNIX_TIMESTAMP(NOW()))");
        # 更新查询集
        $result = $conn->query($query);
    }

刷新Ocr的免费次数

调用了百度AI的文字识别Ocr，在每天的24:00自动刷新次数，因此本地数据库也要根据时间自动刷新Ocr的免费次数。

index.php:

<?php
# 检查并刷新当天的免费次数
if (strtotime(date('Ymd')) > $rows[0]['unix_time']) {
	# 当天0点的时间戳 > 已用识别记录的时间戳 :  记录是昨天的，需要刷新当天次数
	$c_str = "UPDATE `index` SET
	`count`=0 WHERE
	`app_id`='{$Data['a']}' AND";
	$conn->query($c_str . " `type`='basicAccurate'");
	$conn->query($c_str . " `type`='basicGeneral'");
	
	# 更新查询集
	$rows[0]['count'] = 0;
	$rows[1]['count'] = 0;
     }

PHP自定义接收Request请求体

PHP可以自定义接收哪种的Request请求，比如POST, GET, DELETE, PUT, CONNECT, OPTIONS, HEAD, TRACE，现在GET和POST是最常见的HTTP请求方式。
PHP通过header('Access-Control-Allow-Methods:GET,POST,OPTIONS');语句接收指定的请求。

GET通常用于从数据库获取数据。PHP用$_GET来获取请求体；
POST通常用于跟数据库进行交互。

PHP用$_POST来获取Web原生post请求，即Content-Type:application/x-www-form-urlencoded;charset=utf-8的请求体。
PHP用file_get_contents('php://input');来获取JSON格式的POST请求体，使用json_decode()解析为php的array数组。

前端发送POST请求时如果不是同个域的，就会遇到同源策略的限制（同源策略是什么?） PHP通过请求头的Origin属性检测请求是否符合同源策略。这里我不做同源限制，添加header('Access-Control-Allow-Origin:*');语句允许跨域请求。

PHP通过header('Access-Control-Allow-Headers:...')语句接收指定的请求头，默认必须设置Content-Type这个值。设置Origin, X-Requested-With以及Allow-Methods:OPTIONS的原因看详情

index.php:

<?php
header('Access-Control-Allow-Origin:*');
header('Access-Control-Allow-Methods:GET,POST,OPTIONS');
header('Access-Control-Allow-Headers:Origin, X-Requested-With, Content-Type');

MySQL数据表结构

# 连接mysql数据库: mysql -uroot -p...			# -u:用户名, -p:密码  
# mysql创建数据库: create database `dbname`;		# cmd输入命令要用;结尾
mysql> create database `pdfrecord`;
mysql> use `pdfrecord`;

MySQL建库建表指南

后端只用了2张表：
index记录了百度AI用户的Ocr接口免费次数
record记录用户的上传文件

数据表结构如下:
index

列名	数据类型	长度	主键	外键	允许空	默认值	说明
id	int	4	是	否	否		唯一标识
app_id	varchar	30	否	否	否		Ocr的APP_ID
type	varchar	30	否	否	否		Ocr的类型
count	int	8	否	否	否		Ocr的次数
unix_time	int	11	否	否	否		10位UNIX时间戳

record

列名	数据类型	长度	主键	外键	允许空	默认值	说明
id	int	4	是	否	否		唯一标识
filename	varchar	50	否	否	否		文件名称
filesize	varchar	20	否	否	否		文件大小(Mb)
app_id	varchar	30	否	否	否		Ocr的APP_ID
type	varchar	30	否	否	否		Ocr的类型
unix_time	int	11	否	否	否		10位UNIX时间戳