批量提取图片文字 python python如何提取图片中的文字

转载

编程梦想家 2024-03-06 06:31:13

文章标签 批量提取图片文字 python python API bc 文章分类 Python 后端开发

获取图片文字

如何使用python获取图片文字呢？

1、通过python的第三方库pytesseract获取

通过pip install pytesseract导入。

1.1、安装tesseract-ocr

先在官网下载对应操作系统的tesseract-ocr ，比如我现在在windows系统下，就下载exe文件安装，可点击这里下载，下载后运行exe后选择一个目录安装，这个目录需要记住，后面中需要用到，比如我的目录为D:\ruanjian\Tesseract-OCR。

1.2 下载训练好的语言包

地址 ,这里想提取图片中的中文字，于是下载chi_sim.traineddata，下载到上面安装tesseract-ocr目录中的文件夹tessdata中，如图：

批量提取图片文字 python python如何提取图片中的文字_API

1.3 代码

import pytesseract
from PIL import Image

# 打开一张图片
image = Image.open(r'images\82-望岳.png')
pytesseract.pytesseract.tesseract_cmd = r'D:\ruanjian\Tesseract-OCR\tesseract.exe'
tessdata_dir_config = r'--tessdata-dir "D:\ruanjian\Tesseract-OCR\tessdata"'
# 提取中文，如果是提取英文，则先下载语言包，然后设置以下参数lang='eng'即可。
code = pytesseract.image_to_string(image, lang='chi_sim', config=tessdata_dir_config)

print(code)

比如我需要提取以下图片文字：

批量提取图片文字 python python如何提取图片中的文字_bc_02

处理结果：

批量提取图片文字 python python如何提取图片中的文字_python_03

这种方式优点就是可以无限次运行，只要配置好电脑环境就可以，缺点就是不能混语言。比如图片中夹杂中文与英文，提取效果就不是很好。

2、使用百度接口

先到百度智能云创建一个应用获取APP_ID、API_KEY、SECRET_KEY

然后下载python的SDK，下载后使用pip install aip-python-sdk-2.2.15.zip安装

import base64
import requests
import time
import ast
from aip import AipOcr

# https://console.bce.baidu.com/ai/#/ai/ocr/overview/index
""" 你的 APPID AK SK """
APP_ID = '你的'  
API_KEY = '你的'
SECRET_KEY = '你的'
# 百度api客户端
CLIENT = AipOcr(APP_ID, API_KEY, SECRET_KEY)
# 请求头
HEADERS = {
    'Content-Type': 'application/x-www-form-urlencoded'
}
# 获取令牌的url
URL = 'https://aip.baidubce.com/oauth/2.0/token'
ACCESS_TOKEN = None
# 用于记录获取令牌的开始时间
SRART_TIME = time.time()

def get_file_content(filePath):
    # 获取文件内容
    with open(filePath, 'rb') as fp:
        return fp.read()


def get_access_token():
    # 获取令牌
    global ACCESS_TOKEN, SRART_TIME, URL
    response = requests.post(URL,
                             {'grant_type': 'client_credentials', 'client_id': API_KEY, 'client_secret': SECRET_KEY})
    ACCESS_TOKEN = ast.literal_eval（response.content.decode('utf-8'))['access_token']
    SRART_TIME = time.time()


def req_url(image):
    # 调用百度AI接口获取图像识别后的内容，调用接口次数为每日5万次
    global ACCESS_TOKEN, SRART_TIME, HEADERS

    if not ACCESS_TOKEN or (time.time() - SRART_TIME > 7000):
        get_access_token()
    response = requests.post('https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token=%s' % ACCESS_TOKEN,
                             {'image': image}, headers=HEADERS)
    return response.content.decode('utf-8')


if __name__ == '__main__':
    # 图片内容
    image = get_file_content(r'image\望岳.png')
    # 获取分析结果
    ret = req_url(base64.b64encode(image).decode())
    # 字符串转字典
    ret = ast.literal_eval（ret)
    if 'words_result' in ret:
        for words in ret['words_result']:
            print(words['words'])

输出：

批量提取图片文字 python python如何提取图片中的文字_python_04