docx怎么转换成PDF python 库 docx转化为pdf

转载

mob64ca13f937ae 2023-11-10 07:32:05

文章标签 技术方案 python App Office 文章分类 Python 后端开发

一、转换解决方案

Office文件（doc、docx、xls、xlsx、ppt、pptx、txt七种格式）转换为PDF有很多办法，本文只从已实践的四种作解析说明。

1、调用Office自带组件服务转换

前提要求

服务器上必须装Office，最低2007（注：2007安装完成后，需要单独下载PDF转换插件，附件里有），最好是2010以上。
还需要在服务器上开启php的dcom扩展，参考网址如下：

注：32位的office配置office组件服务时，输入comexp.msc -32这个命令而不是dcomcnfg。

调用方法

此种方法本质上是PHP与C++语言的交互。Office从2010版以后，增加了转换PDF文件的COM插件（就是office的C++ SDK库），可以另存为PDF文件。如下图所示：

docx怎么转换成PDF python 库 docx转化为pdf_技术方案

我们需要做的获取这个组件的API，然后调用即可。至于用什么语言，看自己需求吧。PHP的此处已经说过，文末会提供额外的python方式代码。

API地址及参考如下所示

https://docs.microsoft.com/zh-cn/office/vba/api/word.document.exportasfixedformat

docx怎么转换成PDF python 库 docx转化为pdf_技术方案_02

缺陷

这种方式在实践中发现，带特殊对象的word文件，譬如公章等的非图片对象，用这个转换之后出现黑框的情况，所以只适用于普通的文档转换。而且转换文档偏慢。

2、采用开源Office软件libreOffice软件进行转换。

前提要求

服务器上需要安装libreOffice软件。下载地址：

https://zh-cn.libreoffice.org/download/download 在Path下配置环境变量

docx怎么转换成PDF python 库 docx转化为pdf_python_03

调用方法

参考：

docx怎么转换成PDF python 库 docx转化为pdf_Office_04

缺陷

毕竟是开源软件，执行标准跟Microsoft Office不一样，所以转换之后颜色渲染粗细上会有差别，文字页数会有一定的偏移但并不影响（跟文档里的图片布局排版有关）。另外，对非图片对象的图片会自动转换为图片对象来保存，这点不错。

3、调用PDF虚拟打印机的方式转换

前提要求

客户端上安装PDF虚拟打印机，注意是客户端安装

docx怎么转换成PDF python 库 docx转化为pdf_Office_05

目前虚拟打印机的厂商有很多家，除了Microsoft自带的，最权威的有Adobe acrobat即Adobe PDFMaker。

安装完成后，在office里会自动加载插件，保存选项里会多出一项：

docx怎么转换成PDF python 库 docx转化为pdf_App_06

还有一家是PDFCreator。官网可下载。下载地址：

https://www.pdfforge.org/pdfcreator

调用方法

客户端上安转打印机后，由客户端触发调用，然后再动态的传到服务器端。
目前调用的API暂时未找着，查到的都只是对PDF编辑、合并一类。如果有哪位大大找到了，还请劳烦告知我一声。此处先标记一下，以后再补充。

缺陷

几乎没有什么缺陷，最好的一种方式。唯一遗憾的是笔者未找到在服务器端安装后提供的API可供调用的。

二、注意事项

转换完成后，如果字体乱码或不一致，则应该是服务器缺少字体所致，安装后即可解决。
带签章的文件转完后，签章盖住下方文字。原因是盖章之前未选择衬于文字下方。此处需告知客户在盖章前，必须选择“衬于文字下方”。
PHP的exec函数执行错误。原因是需要在php.ini里开启exec扩展即可。

三、扩展

根据研究，Office转换为html展示，或html转换为word、pdf展示，皆可参考以上方案。

四、Python代码示例

参考了这位大大的文章，做了多线程的优化且增加了日志记录，如下：

import os
import pythoncom
import threading
import logging

from win32com.client import constants, gencache, DispatchEx

_Log = logging.getLogger("PDFConverter")


class PDFConverter:
    def __init__(self, _pathname, _export_path=None, _is_async=True):
        """
        :param _pathname: 输入的文件夹或文件路径
        :param _export_path: 导出目录路径
        :param _is_async: 是否异步执行
        """
        self._pathname = _pathname
        self._export_path = _export_path
        self._is_async = _is_async
        self._handle_postfix = ['doc', 'docx', 'ppt', 'pptx', 'xls', 'xlsx', 'txt']
        self._filename_list = list()

    def _parse_filenames(self):
        """
        解析路径，获取当前路径下的所有文件
        """
        full_pathname = os.path.abspath(self._pathname)
        if os.path.isfile(full_pathname):
            if self._is_legal_postfix(self._pathname):
                self._filename_list.append(full_pathname)
            else:
                raise TypeError('文件 {} 后缀名不合法！仅支持如下文件类型：{}。'.format(self._pathname, '、'.join(self._handle_postfix)))

        elif os.path.isdir(full_pathname):
            # 遍历的目录的地址, 返回的是一个三元组(root,dirs,files)
            # real_path为当前正在遍历的这个文件夹的地址
            # files为real_path文件夹中所有的文件
            for real_path, _, files in os.walk(full_pathname):
                for name in files:
                    filename = os.path.join(full_pathname, real_path, name)
                    if self._is_legal_postfix(filename):
                        self._filename_list.append(os.path.join(filename))
        else:
            raise TypeError('文件/文件夹 {} 不存在或不合法！'.format(self._pathname))

    def _is_legal_postfix(self, filename):
        return filename.split('.')[-1].lower() in self._handle_postfix and not os.path.basename(filename).startswith(
            '~')

    @staticmethod
    def doc(input_file, export_file):
        """
        doc 和 docx 文件转换
        """
        pythoncom.CoInitialize()
        # gencache.EnsureModule('{00020905-0000-0000-C000-000000000046}', 0, 8, 4)
        # 使用启动独立的进程
        wordApp = DispatchEx("PDFMakerAPI.PDFMakerApp")
        try:
            # wordApp.Visible = 0
            # wordApp.DisplayAlerts = 0
            # doc = wordApp.Documents.Open(input_file)
            # doc.ExportAsFixedFormat(export_file, constants.wdExportFormatPDF,
            #                         OptimizeFor=constants.wdExportOptimizeForPrint,
            #                         Item=constants.wdExportDocumentWithMarkup,
            #                         IncludeDocProps=True,
            #                         CreateBookmarks=constants.wdExportCreateHeadingBookmarks)
            # # 关闭
            # doc.Close()
            wordApp.CreatePDF(input_file, export_file)
        except Exception as e:
            _Log.error(input_file, "文件转换失败：", e)
        finally:
            if wordApp:
                wordApp.Quit()
        # 释放资源
        pythoncom.CoUninitialize()

    def docx(self, input_file, export_file):
        self.doc(input_file, export_file)

    def txt(self, input_file, export_file):
        self.doc(input_file, export_file)

    @staticmethod
    def xls(input_file, export_file):
        """
        xls 和 xlsx 文件转换
        """
        pythoncom.CoInitialize()
        # gencache.EnsureModule('{00020905-0000-0000-C000-000000000046}', 0, 8, 4)
        xlsApp = DispatchEx("Excel.Application")
        try:
            xlsApp.Visible = 0
            xlsApp.DisplayAlerts = 0
            books = xlsApp.Workbooks.Open(input_file, False)
            books.ExportAsFixedFormat(0, export_file)
            books.Close(False)
        except Exception as e:
            _Log.error(input_file, "文件转换失败：", e)
        finally:
            if xlsApp:
                xlsApp.Quit()
        # 释放资源
        pythoncom.CoUninitialize()

    def xlsx(self, input_file, export_file):
        self.xls(input_file, export_file)

    @staticmethod
    def ppt(input_file, export_file):
        """
        ppt 和 pptx 文件转换
        """
        pythoncom.CoInitialize()
        pptApp = DispatchEx("PowerPoint.Application")
        try:
            pptApp.Visible = 0
            pptApp.DisplayAlerts = 0
            ppt = pptApp.Presentations.Open(input_file, False, False, False)
            ppt.ExportAsFixedFormat(export_file, 2, PrintRange=None)
            ppt.Close()
        except Exception as e:
            _Log.error(input_file, "文件转换失败：", e)
        finally:
            if pptApp:
                pptApp.Quit()
        # 释放资源
        pythoncom.CoUninitialize()

    def pptx(self, input_file, export_file):
        self.ppt(input_file, export_file)

    def converter(self):
        """
        进行批量处理，根据后缀名调用函数执行转换
        """
        _flag = 1
        _export_file_path = ""

        if self._export_path and os.path.isdir(self._export_path):
            _export_file_path = self._export_path
        else:
            _flag = 0
        # 解析目标文件路径，存入list，并确认最终导出的文件夹路径_export_file_path
        try:
            self._parse_filenames()
        except TypeError as e:
            _Log.error("解析目标文件/路径报错", e)
            return 1

        _Log.info('需要转换的文件数：%d' % len(self._filename_list))
        # threads = []
        for filename in self._filename_list:
            postfix = filename.split('.')[-1].lower()
            funcCall = getattr(self, postfix)
            # self._input_file = filename
            if _flag is 0:
                _export_file_path = os.path.dirname(filename)
            export_file = os.path.join(_export_file_path, str(os.path.splitext(filename)[0]) + '.pdf')

            if self._is_async:
                t = threading.Thread(target=funcCall, args=(filename, export_file))
                # threads.append(t)
                t.start()
            else:
                funcCall(filename, export_file)

        # 开始调用start方法，同时开始所有线程
        # for i in range(0, len(self._filename_list)):
        #     threads[i].start()

        _Log.info('转换完成！')
        return 0


if __name__ == "__main__":
    # 支持文件夹批量导入,也支持单个文件的转换
    # 默认异步导出到每个文件的对应文件下，也可指定相应目录
    folder = 'tmp'
    pathname = os.path.join(os.path.abspath('.'), folder)
    # pathname = 'test.doc'
    PDF = PDFConverter(pathname)
    PDF.converter()

其中

gencache.EnsureModule('{00020905-0000-0000-C000-000000000046}', 0, 8, 4)

并未用到，也可以用gencache去启动进程，此处不做探究。

不足之处，还请各位多多指教

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：Python课程意义 python课程主要讲什么

下一篇：java public class java public class怎么创建

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯