PDF解析工具 python pdf 文件解析

转载

云端行者 2023-12-04 19:02:29

文章标签 PDF解析工具 python python 开发语言 Python 扩展包 文章分类 Python 后端开发

一 python解析pdf

很多文件为了安全都会存成 PDF 格式，比如有的论文、技术文档、书籍等等，程序读取这些文档内容带来了很多麻烦。Python 目前解析 PDF 的扩展包有很多，这里将对比介绍 PyPDF2、pdfplumber、pdfminer3k 以及 Camelot，告诉你哪个是好用的 PDF 解析工具。

一 PyPDF2 解析 PDF 文档

import PyPDF2
path = r"****.pdf"
#使用open的‘rb’方法打开pdf文件（这里必须得使用二进制rb的读取方式）
mypdf = open(path,mode='rb')
#调用PdfFileReader函数
pdf_document = PyPDF2.PdfFileReader(mypdf)
#使用pdf_document变量，获取各个信息
#或者PDF文档的页数
pdf_document.numPages
#输出PDF文档的第一页内容
first_page = pdf_document.getPage(0)
print(first_page.extractText())

输出文档第一页内容之后会发现，PyPDF2 方法对中文的支持不好，而对英文的支持会很好，所以如果处理中文文档的话，可以使用下面这个方法。

二 pdfplumber 解析 PDF 文档

安装的话直接使用下面语句即可：pip install pdfplumber

1 读取PDF

pdfplumber 提供了两种读取pdf的方式：

pdfplumber.open("path/to/file.pdf")
pdfplumber.load(file_like_object)

这两种方法都返回pdfplumber.PDF类的实例(instance)。
加载带密码的pdf需要传入参数password，例如：pdfplumber.open(“file.pdf”, password = “test”)

2 pdfplumber.PDF类

处于最上层的pdfplumber.PDF类表示单个PDF，并且具有两个主要属性：

属性	描述
.metadata	从PDF的Info中获取元数据键 /值对字典。通常包括“ CreationDate”，“ ModDate”，“ Producer”等。
.pages	一个包含pdfplumber.Page实例的列表，每一个实例代表PDF每一页的信息。

3 pdfplumber.Page类

pdfplumber.Page类是pdfplumber整个的核心，大多数操作都围绕这个类进行操作，它具有以下几个属性：

属性	描述
.page_number	页码顺序，从第一页的1开始，第二页为2，依此类推。
.width	页面宽度
.height	页面高度
.objects/.chars/.lines/.rects/ .curves/.figures/.images	这些属性中的每一个都是一个列表，并且每个列表针对嵌入面上的每个此类对象包含一个字典。有关更多详细信息，请参见下面的"对象(Object)"。

以及这些主要的方法(method)：

方法	描述
.crop(bounding_box)	返回裁剪后的页面，该bouding_box应表示为具有值(x0, top, x1, bottom)的4元组。裁剪后的页面保留了至少部分位于边界框内的对象。如果对象仅部分落在该框内，则也会被涵盖。
.within_bbox(bounding_box)	和.crop相似，但是只会包含完全在bounding_box内的部分。
.filter(test_function)	返回仅包含.objects的页面版本，该对象的test_function(obj)返回True。
.extract_text(x_tolerance=0, y_tolerance=0)	将页面的所有字符对象整理到一个字符串中。若其中一个字符的x1与下一个字符的x0之差大于x_tolerance，则添加空格。若其中一个字符的doctop与下一个字符的doctop之差大于y_tolerance，则添加换行符。
.extract_words(x_tolerance=0, y_tolerance=0, horizontal_ltr=True, vertical_ttb=True)	返回所有单词外观及其边界框的列表。字词被认为是字符序列，其中（对于“直立”字符）一个字符的x1和下一个字符的x0之差小于或等于x_tolerance，并且一个字符的doctop和下一个字符的doctop小于或等于y_tolerance。对于非垂直字符也采用类似的方法，但是要测量它们之间的垂直距离，而不是水平距离。参数horizontal_ltr和vertical_ttb指示是否应从左到右（对于水平单词）/从上到下（对于垂直单词）读取字词。
.extract_tables(table_settings)	从页面中提取表格数据。有关更多详细信息，请参见下面的“表格抽取”。
.to_image(**conversion_kwargs)	返回PageImage类的实例。有关更多详细信息，请参见下面的“可视化调试”。有关conversion_kwargs，请参见此处。

4 对象(Object)

对于每一个pdfplumber.PDF和pdfplumber.Page的实例都提供了对4种对象操作的方法。以下属性均返回所对应对象的Python列表：

.chars 代表每一个独立的字符；
.annos 代表注释里的每一个独立的字符；
.lines 代表一个独立的一维的线；
.rects 代表一个独立的二位的矩形；
.curves 代表一系列连接的点；
.images 代表一个图像；

每一个对象用一个Python词典dict进行表示，具有以下属性：

5 chars / annos 属性

属性	描述
page_number	找到此字符的页码。
text	字符文本，如"z"、“Z"或者"你”。
fontname	字符的字体。
size	字号。
adv	等于文本宽度字体大小缩放因子。
upright	字符是否是直立的。
height	字符高度。
width	字符宽度。
x0	字符左侧到页面左侧的距离。
y0	字符底部到页面底部的距离。
x1	字符右侧到页面左侧的距离。
y1	字符顶部到页面底部的距离。
top	字符顶部到页面顶部的距离。
bottom	字符底部到页面顶部的距离。
doctop	字符顶部到文档顶部的距离。
obj_type	`"char"`或`"anno"`

6 line 属性

属性	描述
page_number	找到此线的页码。
height	线的高度。
width	线的宽度。
x0	线的最左侧到页面左侧的距离。
y0	线的底部到页面底部的距离。
x1	线的最右侧到页面左侧的距离。
y1	线的顶部到页面底部的距离。
top	线的顶部到页面顶部的距离。
bottom	线的底部到页面顶部的距离。
doctop	线的顶部到文档顶部的距离。
linewidth	线的粗度。
obj_type	“line”

7 rect 属性

属性	描述
page_number	找到此矩形的页码。
height	矩形的高度。
width	矩形的宽度。
x0	矩形的最左侧到页面左侧的距离。
y0	矩形的底部到页面底部的距离。
x1	矩形的最右侧到页面左侧的距离。
y1	矩形的顶部到页面底部的距离。
top	矩形的顶部到页面顶部的距离。
bottom	矩形的底部到页面顶部的距离。
doctop	矩形的顶部到文档顶部的距离。
linewidth	矩形边框的粗度。
obj_type	“rect”

8 curve 属性

属性	描述
page_number	找到此曲线的页码。
points	点，作为(x,top)元组的列表，用以描述曲线。
height	曲线bounding_box的高度。
width	曲线bounding_box的宽度。
x0	曲线的最左侧点到页面左侧的距离。
y0	曲线最底部点到页面底部的距离。
x1	曲线的最右侧点到页面左侧的距离。
y1	曲线最顶部点到页面底部的距离。
top	曲线最顶部的点到页面顶部的距离。
bottom	曲线最底部点到页面顶部的距离。
doctop	曲线最顶部点到文档顶部的距离。
linewidth	连线的粗度。
obj_type	“curve”

此外，pdfplumber.PDF和pdfplumber.Page都提供对两个派生对象列表的访问：.rect_edges（将每个矩形分解成四行）和.edges（将.rect_edges与.lines组合）。

1 解析文本内容

pdfplumber 中的 extract_text 函数是可以直接识别 PDF 中的文本内容。
首先读取整个 PDF 文档文本内容

import pdfplumber
import pandas as pd

with pdfplumber.open(path) as pdf: 
    content = ''
    #len(pdf.pages)为PDF文档页数
    for i in range(len(pdf.pages)):
      #pdf.pages[i] 是读取PDF文档第i+1页
        page = pdf.pages[i]
        #page.extract_text()函数即读取文本内容，下面这步是去掉文档最下面的页码
        page_content = '\n'.join(page.extract_text().split('\n')[:-1])
        content = content + page_content
    print(content)

解析文本内容，取出 PDF 的售后解决方案中的故障代码内容，可以看到故障代码内容，如下图所示，故障代码在两页里面。

PDF解析工具 python pdf 文件解析_开发语言

根据这类文档的规律可以知道，故障代码内容都是在文本故障代码列举如下：和 2.之间，因此解析 PDF 之后取出这部分内容还是比较容易的：print(content.split('故障代码列举如下：')[1].split('2.')[0])。

运行结果如下，可以看出来很好的取出来这部分内容了。

PDF解析工具 python pdf 文件解析_扩展包_02

2 解析表格内容

上面介绍了 pdfplumber 解析文本内容的方法，这里介绍一下解析表格内容的方法，和上面十分类似，pdfplumber 中的 extract_tables 函数是可以直接识别 PDF 中的表格的。

这里展示解析 PDF 文档中第一页表格的方法，可以看出案例 PDF 中第一页的开头就是一个表格：

PDF解析工具 python pdf 文件解析_扩展包_03

由于使用 extract_tables 函数得到的是 Table 一个嵌套的 List 类型，转化成 DataFrame 会更方便查看和分析。

import pdfplumber
import pandas as pd

with pdfplumber.open(path) as pdf:      
    first_page = pdf.pages[0]
    for table in first_page.extract_tables(): 
        df = pd.DataFrame(table)
df

可以看出这个函数非常容易的将 PDF 文档中的表格提取出来了。

PDF解析工具 python pdf 文件解析_Python_04

看完上面的可以知道 pdfplumber 扩展包可以非常好的解析 PDF 的文本内容和表格内容，并且对中文有很好的支持，十分推荐使用该方法。

三 pdfminer3k 解析 PDF 文档

pdfminer3k 是 pdfminer 的 python3 版本，主要用于读取 pdf 中的文本。如果直接搜索 pdfminer3k 的话会发现网上有非常多的教程，但是看了之后，你可能就想吐槽这些教程太繁琐了，看着头疼。

1 安装

在python3.6中就需要安装：pip install pdfminer3k；

只存在文字的PDF	此程序会全部读取出文字
只存在图片的PDF	此程序不会读取出任何东西
存在图片和文字	此程序只会读出文字，不会识别图片

2 参考链接

下面这个是 pdfminer 解析 PDF 文档的流向图。

PDF解析工具 python pdf 文件解析_开发语言_05

pdfminer 方法解析 PDF 可以很好的提取文本内容，但是对于表格数据，能提取出文字，但是没有格式，会很不友好。因此你如果只需要提取文本内容的话，可以使用 pdfminer 扩展包，这个包也能很好的支持中文。中文需要考虑编码和全角半角问题。

四 Camelot 解析 PDF 文档

Camelot 先使用 pip install camelot-py 语句安装，如果报错，参考安装 Camelot 教程。
另外，使用 camelot 需要安装 cv2 包，上面这个安装教程中也有。

import camelot
import pandas as pd
tables = camelot.read_pdf(filepath=path,pages='1',flavor='stream')
df = pd.DataFrame(tables[0].data)

Camelot 读取 PDF 文件中的表格数据很好用，并且能够很好的支持中文，但是 Camelot 有很多局限性。
首先，使用 stream 时，表格无法被自动侦测到，stream 把整个页面当成一个 table。
其次，camelot 只用使用基于文本的 PDF 文件而不能使用扫描文档。
综上所述，建议使用 pdfplumber 扩展包来解析 PDF 文档的文本和表格，如果只解析文本内容，也可以使用 pdfminer ，而解析英文文档内容，可以使用 PyPDF2 。

五 pymupdf

1 读取pdf中的图片

六链接

使用Python第三方库pdfminer提取PDF内容，并解决中文编码不支持的问题：https://zhuanlan.zhihu.com/p/29410051 三大神器助力Python提取pdf文档信息：https://cloud.tencent.com/developer/article/1395339 https://xueqiu.com/7967593043/145833647：https://xueqiu.com/7967593043/145833647

二 C#解析pdf

一 sumatrapdf编译

本人在开源代码sumatrapdf里面下载最新的代码进行编译，本机下载最先专业版VS2017进行运行，不过老是编译运行报错，我在开源网站想作者咨询。
版本：https://github.com/sumatrapdfreader/sumatrapdf 问题：https://github.com/sumatrapdfreader/sumatrapdf/issues/1031

Please help me. When I download the latest version（sumatrapdf-master.zip）, I use VS2017 to run the wrong report.My laptop win10 64 bit，When I unzip the file, open vs2017/SumatraPDF.sln, compile and run error.
----------------------------------ERROR-----------------------------------------------------------
Severity code indicates that item file rows are forbidden to display state.
Error MSB8020 could not find the v141_xp generation tool (platform toolset = “v141_xp”). If you want to use the v141_xp generation tool to generate, install the v141_xp generation tool. Or, you can upgrade to the current Visual Studio tool by selecting the “project” menu or right-click the solution, and then selecting the “redefine solution target”. Cmapdump D:\Program\VisualStudio2017\Community\Common7\IDE\VC\VCTargets\Microsoft.Cpp.Platform.targets 57
I hope you can teach me how to run it, compile those files, and guide me according to your successful compilation. Thank you.
kjk回答：您很可能需要重新运行Visual Studio安装程序，并确保安装XP工具链组件。

当我重新安装VS2017把XP的安装上，重新运行，不过还是有报错：

Today,when i re-run Visual Studio installer and make sure to install XP toolchain component.
 ----------------------------------ERROR-----------------------------------------------------------
 1>------ Build started: Project: synctex, Configuration: Debug x64 ------
 2>------ Build started: Project: mupdf, Configuration: Debug x64 ------
 1>synctex_parser.c
 1>d:\sumatrapdf-master\ext\synctex\synctex_parser.c(715): error C2220: warning treated as error - no ‘object’ file generated
 1>d:\sumatrapdf-master\ext\synctex\synctex_parser.c(715): warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in Unicode format to prevent data loss
 1>Done building project “synctex.vcxproj” – FAILED.
 2>pdf-write.c
 2>d:\sumatrapdf-master\mupdf\source\pdf\pdf-write.c(2075): error C2220: warning treated as error - no ‘object’ file generated
 2>d:\sumatrapdf-master\mupdf\source\pdf\pdf-write.c(2075): warning C4819: The file contains a character that cannot be represented in the current code page (0). Save the file in Unicode format to prevent data loss
 2>Done building project “mupdf.vcxproj” – FAILED.
 3>------ Build started: Project: SumatraPDF, Configuration: Debug x64 ------
 3>LINK : fatal error LNK1104: cannot open file ‘D:\sumatrapdf-master\dbg64\mupdf.lib’
 3>Done building project “SumatraPDF.vcxproj” – FAILED.
 ========== Build: 0 succeeded, 3 failed, 16 up-to-date, 0 skipped ==========
 How is this going?

kjk回答：修改pdf-write.c第2075行的代码，然后synctex_parser.c从新保存一下文件就可以了

pdf-write.c
 row 2075
 //fprintf(opts->out, “%%\316\274\341\277\246\n\n”); //wrong
 fprintf(opts->out, “%%\xC2\xB5\xC2\xB6\n\n”); //ok—come from mupdf-1.12.0-source
 You can disable that warning. It’s caused by using non-english OS.

重新运行，终于可以了，最新本部3.2

PDF解析工具 python pdf 文件解析_开发语言_06

二 C#获取pdf页数

int get_count(string filePath)
{
	int count = -1;
	using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
	{
	    StreamReader reader = new StreamReader(fs);
	    //从流的当前位置到末尾读取流
	    string pdfText = reader.ReadToEnd();
	    Regex rgx = new Regex(@"/Type\s*/Page[^s]");
	    MatchCollection matches = rgx.Matches(pdfText);
	    count = matches.Count;
	}
	Log.Information("cout page: " + count.ToString());
	return count;
}

三基于MuPDF库实现PDF文件转换成PNG格式图片

地址：https://www.mupdf.com/downloads/index.html 本节内容来自这里。

四 C# pdf转成图片（可转成jpg、png等格式）

PDF解析工具 python pdf 文件解析_python_07

using PdfiumViewer;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace PDFtoJPG
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
        /// <summary>
        /// PDF的textbox后面对应的按钮
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void butPDF_Click(object sender, EventArgs e)
        {
            OpenFileDialog dialog = new OpenFileDialog();
            if (dialog.ShowDialog() == DialogResult.OK)
            {
                //FileName：显示文件路径+名字
                this.txtPDF.Text = dialog.FileName;

        }
    }

    /// <summary>
    /// 图片的textbox后面对应的按钮
    /// </summary>
    /// <param name="sender"></param>
    /// <param name="e"></param>
    private void butTUPIAN_Click(object sender, EventArgs e)
    {
        #region 测试SafeFileName，不用关
        //OpenFileDialog dialog = new OpenFileDialog();
        //if (dialog.ShowDialog() == DialogResult.OK)
        //{
        //    //SafeFileName：只显示文件名字
        //    this.txtJPG.SelectedText = dialog.SafeFileName;

        //}
        #endregion

        System.Windows.Forms.FolderBrowserDialog folder = new System.Windows.Forms.FolderBrowserDialog();
        if (folder.ShowDialog() == DialogResult.OK)
        {

            //SelectedPath:获取文件夹绝对路径,显示在textbox里面
            this.txtJPG.Text = folder.SelectedPath;

        }
    }

    /// <summary>
    /// PDF转图片格式的button按钮
    /// </summary>
    /// <param name="sender"></param>
    /// <param name="e"></param>
    private void butPDFtoTUPIAN_Click(object sender, EventArgs e)
    {
        string strpdfPath = txtPDF.Text.ToString();
        var pdf = PdfiumViewer.PdfDocument.Load(strpdfPath);
        var pdfpage = pdf.PageCount;
        var pagesizes = pdf.PageSizes;

        for (int i = 1; i <= pdfpage; i++)
        {
            Size size = new Size();
            size.Height = (int)pagesizes[(i - 1)].Height;
            size.Width = (int)pagesizes[(i - 1)].Width;
            //可以把".jpg"写成其他形式
            RenderPage(strpdfPath, i, size, txtJPG.Text.ToString() + "\\"+ txtName.Text.ToString() + @".jpg");
        }

        MessageBox.Show("转换成功！");
    }
    public void RenderPage(string pdfPath, int pageNumber, System.Drawing.Size size, string outputPath, int dpi = 300)
    {
        using (var document = PdfiumViewer.PdfDocument.Load(pdfPath))
        using (var stream = new FileStream(outputPath, FileMode.Create))
        using (var image = GetPageImage(pageNumber, size, document, dpi))
        {
            image.Save(stream, ImageFormat.Jpeg);
        }
    }
    /// <summary>
    /// 
    /// </summary>
    /// <param name="pageNumber">pdf文件张数</param>
    /// <param name="size">pdf文件尺寸</param>
    /// <param name="document">pdf文件位置</param>
    /// <param name="dpi"></param>
    /// <returns></returns>
    private static System.Drawing.Image GetPageImage(int pageNumber, Size size, PdfiumViewer.PdfDocument document, int dpi)
    {
        return document.Render(pageNumber - 1, size.Width, size.Height, dpi, dpi, PdfRenderFlags.Annotations);
    }
  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：微信小程序支付签名生成 java 微信支付签名怎么生成

下一篇：python字符串 clob Python字符串函数

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯