python如何提取图片的文字 python 图片数据提取

转载

落花有意飞花 2024-07-20 20:35:00

文章标签 python如何提取图片的文字 python powerpoint 开发语言网络 文章分类 Python 后端开发

大家好，我是陈序不懂程序。人工智能时代，python 这门简单易学的编程语言变得更加热门。个人认为 Python 最大的优势在于类库生态丰富，很多工具都有现成的开源项目，比如 AI 工具、文件处理工具等。

虽然我们未必有时间深入学习 Python，但还是要学习下如何运行这些开源项目，丰富自己的 “弹药库”。

这篇文章分享一位 Java 开发者 - 勇哥（作者认为自己是 python 小白），花了两个小时，第一次成功运行起来 python 项目的经历。为大家提供入门参考。

python如何提取图片的文字 python 图片数据提取_python如何提取图片的文字

项目是 powerpoint-extractor ，可以将 ppt 文件中的图片提取出来，并输出到固定的目录。

1 安装 python 环境

首先打开终端，打开后输入 python3 。确定电脑上是否已安装 python3，如果输入 python 是查看 mac 上的自带版本。

命令：python3【直接回车】

出现下面是页面，表示已经安装python3 【退出时可输入：exit（）然后点回车】

python如何提取图片的文字 python 图片数据提取_网络_02

若没有安装，安装 python3 如下两种方式：

第一种方法 brew 安装 python3 ：brew install python3
第二种方法官网 Python Releases for macOS ，根据自己的需求下载自己需要的版本下载。

python如何提取图片的文字 python 图片数据提取_网络_03

2 项目 powerpoint-extractor

通过 git 命令 clone 该项目 :

git clone git@github.com:2TallTyler/powerpoint-extractor.git

因为项目依赖 python-pptx 组件，通过清华的镜像执行如下的命令：

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple python-pptx

python如何提取图片的文字 python 图片数据提取_python如何提取图片的文字_04

执行完成之后，可以通过 pip3 list 命令查看已安装包列表：

python如何提取图片的文字 python 图片数据提取_开发语言_05

3 PyCharm 配置

通过 PyCharm 打开该项目 :

python如何提取图片的文字 python 图片数据提取_网络_06

上图，我们发现 python 解释器并没有配置好，py 脚本显示 import 包失败 。

python如何提取图片的文字 python 图片数据提取_网络_07

点击添加 python 解释器按钮，勾选继承全局包，并确认好 python3 的执行目录是否正确，点击 OK 即可完成配置。

点开 extract.py ，核心代码非常容易理解：

for eachfile in glob.glob(self.input_dir + os.sep + "*.pptx"):       ppt = Presentation(eachfile)       print("* " + eachfile)       presentation_count += 1       self.cur_image_index = 1          name = self.generate_image_name_part(eachfile)          # 遍历每张幻灯片       for page, slide in enumerate(ppt.slides):           # 将幻灯片上的所有文本收集到一个字符串中，以换行符分隔           text = ''           for shape in slide.shapes:               if shape.has_text_frame and shape.text.strip():                   text += os.linesep                   text += shape.text              # 收集每张幻灯片中的图像           self.cur_slide_images = []              # 保存幻灯片中的图像           for shape in slide.shapes:               self.drill_for_images(shape, page + 1, name)              # 将页码、收集到的文本和演讲者备注作为新行写入CSV文件           image_list = ''           if len(self.cur_slide_images) > 0:               image_list = ','.join(self.cur_slide_images)  # 将图像列表转换为逗号分隔的字符串              # 将信息写入CSV文件           writer.writerow([eachfile, page + 1, text, slide.notes_slide.notes_text_frame.text, image_list])

在代码小抄中阅读代码，体验更佳：

这段代码执行了以下操作：