javaword转pdf乱码 java word转pdf中文不显示

转载

footballboy 2023-08-20 15:13:20

文章标签 javaword转pdf乱码 java html poi linux 文章分类 Java 后端开发

1、优先学习这些大佬的文章，比我更加优秀，也更加详尽
2、同时因有感网上能查到的资料水准不齐，优秀文章难找，为了避免后来者重复造轮子，也为了提醒自己不断学习，特发此文以总结和整合
3、本人才疏学浅，加之做完需求后没有及时记录，因此只能尽力完成本文，如果各位读者在使用中出现问题或者能解决文中遇到的问题，欢迎回复和讨论

前段时间工作上有一个需求，要将生成的word文档转为pdf的形式，具体要求如下：

1、最好是能够跨系统运行(windows和linux)
2、尽可能少的额外配置且不能安装第三方软件
3、格式不能大乱，要尽可能还原

其实如果不要求跨系统的话很简单的，因此这一需求的难点其实在于跨系统的运行上，还好老大给了两天时间让我慢慢摸，经过一段时间的学(bai)习(du)，研(goo)究(gle)和探(stack)讨(overflow)，我根据需求整理出了以下思路并实现demo，这其中其中存在的问题我也会标明出来

~~1、poi直接转（复杂格式下极度混乱，放弃）~~
2、html中转（文档整体位移）
~~3、aspose（正式版jar包收费，放弃）~~
4、jacob（不能跨平台，目前选用的解决方案之一）
5、Docx4j（空格丢失，目前选用的解决方案之一）

ps: 除此之外还有使用第三方软件的解决思路如libreoffice和openoffice，因为需求不允许就不多介绍了

ps2: 同时，如果只是个人学习使用的话我比较推荐aspose的试用版，使用方便，代码简单

html中转

这一思路主要是通过失败的方法1改出来的，方法1格式丢失过于严重，考虑到html对于格式保存较为完好，因此尝试通过html中转
使用saucer来转一定程度上比itext要舒服很多

踩到的坑：
1、当文档字体为“宋体(中文正文)”时，字体似乎是会被识别为Calibri而不是SimSun从而丢失，我尝试在html中进行强转但是没有效果，最终决定将word文件中的“宋体(中文正文)”全都强转为其他可识别字体解决问题
2、特别注意poi和xdocreport的版本，能看出这是比较老的版本了，因为在之后的版本（例如我们常用的poi 4.0）中，poi把xdocreport的类加载路径修改了，但是maven上xdocreport本身因为一直没有更新所以会出现加载失败的问题

**未解决的问题：**生成的pdf文档格式整体向一侧偏移，造成部分文字丢失

先上maven依赖

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>

		<dependency>
		    <groupId>org.apache.commons</groupId>
		    <artifactId>commons-compress</artifactId>
		    <version>1.19</version>
		</dependency>


        <!--iText and flying saucer-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-collections4</artifactId>
            <version>4.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.xmlbeans</groupId>
            <artifactId>xmlbeans</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.itextpdf</groupId>
            <artifactId>itextpdf</artifactId>
            <version>5.5.13</version>
        </dependency>
        <dependency>
            <groupId>com.itextpdf.tool</groupId>
            <artifactId>xmlworker</artifactId>
            <version>5.5.13</version>
         </dependency>
         <dependency>
             <groupId>com.itextpdf</groupId>
             <artifactId>itext-asian</artifactId>
             <version>5.2.0</version>
         </dependency>
         <dependency>
             <groupId>org.xhtmlrenderer</groupId>
             <artifactId>flying-saucer-pdf</artifactId>
             <version>9.0.3</version>
         </dependency>

        <!--poi-->
        <dependency>
            <groupId>xerces</groupId>
            <artifactId>xercesImpl</artifactId>
            <version>2.11.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.14</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.14</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.14</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.14</version>
        </dependency>


        <!--XWPF-->
        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>xdocreport</artifactId>
            <version>2.0.2</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>fr.opensagres.xdocreport.document</artifactId>
            <version>2.0.2</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.core</artifactId>
            <version>1.0.6</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.pdf</artifactId>
            <version>1.0.6</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
            <version>1.0.6</version>
        </dependency>

然后是主要功能代码部分

/**
     * docx格式word转换为html
     *
     * @param fileName
     *            docx文件路径
     * @param outPutFile
     *            html输出文件路径
     * @param imagePath
     *            图片路径
     * @throws TransformerException
     * @throws IOException
     * @throws ParserConfigurationException
     */
    public static void docx2Html(String fileName, String outPutFile,String imagePath) throws TransformerException, IOException, ParserConfigurationException {
        String fileOutName = outPutFile;
        long startTime = System.currentTimeMillis();
        XWPFDocument document = new XWPFDocument(new FileInputStream(fileName));
        List<XWPFParagraph> paragraphs = document.getParagraphs();
        //1、强转中文格式类型，解决中文消失问题
        for (XWPFParagraph paragraph:paragraphs) {
            List<XWPFRun> runs = paragraph.getRuns();
            for (XWPFRun run:runs) {
                if(run.getFontFamily()=="Calibri")
                {
                    run.setFontFamily("SimHei");
                }
                run.setFontFamily("SimSun", XWPFRun.FontCharRange.ascii);
            }
        }
        XHTMLOptions options = XHTMLOptions.create().indent(4);
        // 导出图片
        File imageFolder = new File(imagePath);
        options.setExtractor(new FileImageExtractor(imageFolder));
        // URI resolver
        options.URIResolver(new FileURIResolver(imageFolder));
        File outFile = new File(fileOutName);
        outFile.getParentFile().mkdirs();
        OutputStreamWriter outputStreamWriter = new OutputStreamWriter(new FileOutputStream(outFile), StandardCharsets.UTF_8);
        XHTMLConverter xhtmlConverter = (XHTMLConverter) XHTMLConverter.getInstance();
        xhtmlConverter.convert(document, outputStreamWriter, options);
        String html = FileUtil.readFileToString(fileOutName, "html");
        OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(new File(fileOutName)), StandardCharsets.UTF_8);
        //2、添加标准html头部，解决中文乱码问题
        html="<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"+html;
        int i = html.indexOf("<head>");
        StringBuffer buffer = new StringBuffer(html);
        html=buffer.insert(i+6,"<style type=\"text/css\">\n" +
                "    *\n" +
                "    {\n" +
                "        padding-left: 20pt;\n" +
                "        padding-right: -20pt;\n" +
                "    }\n" +
                "</style>").toString();
        writer.write(html);
        writer.flush();
        writer.close();
        System.out.println("Generate " + fileOutName + " with " + (System.currentTimeMillis() - startTime) + " ms.");
    }
/**
     * docx格式word转换为html
     *
     * @param html
     *            html文件
     * @param pdfName
     *            pdf文件名
     * @param fontDir
     *            指定字体文件夹路径
     * @Param pdfDestPath
     * 			  pdf输出路径
     */

public static void html2pdf(String html, String pdfName, String fontDir,String pdfDestPath) {
        try {
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            ITextRenderer renderer = new ITextRenderer();
            ITextFontResolver fontResolver = (ITextFontResolver) renderer.getSharedContext().getFontResolver();
            //遍历添加中文字体库
            File f = new File(fontDir);
            if (f.isDirectory()) {
                File[] files = f.listFiles((dir, name) -> {
                    String lower = name.toLowerCase();
                    return lower.endsWith(".otf") || lower.endsWith(".ttf") || lower.endsWith(".ttc");
                });
                for (int i = 0; i < files.length; i++) {
                    fontResolver.addFont(files[i].getAbsolutePath(), BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
                }
            }
            //添加字体库结束
            renderer.setDocumentFromString(html);
            renderer.layout();
            renderer.createPDF(os);
            renderer.finishPDF();
            byte[] buff = os.toByteArray();
            //保存到磁盘上
            FileUtil.byte2File(buff,pdfDestPath,pdfName);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

使用Jacob

这是不考虑跨平台的情况下的解决方案，不论是从配置复杂程度还是从转换的完成度来说都是极好的，几乎没有问题（实际上就是在调用本地office的另存为功能吧）
缺陷
1、不能跨平台
2、要求windows上有office和SaveAsPdf插件
3、需要把jacob的dll文件存放到本地的jre/bin中，一定程度上造成污染

链接：https://pan.baidu.com/s/1eBCOnYkem2XdwXI8d_A7_g
提取码：813q

先上maven依赖

<dependency>
            <groupId>com.hynnet</groupId>
            <artifactId>jacob</artifactId>
            <version>1.18</version>
        </dependency>

再来主要功能代码

private static final int wdFormatPDF = 17; // PDF 格式

    public static void word2PDF(String sfileName, String toFileName) {
        System.out.println("启动 Word...");
        long start = System.currentTimeMillis();
        ActiveXComponent app = null;
        Dispatch doc = null;
        try {
            app = new ActiveXComponent("Word.Application");
            app.setProperty("Visible", new Variant(false));
            Dispatch docs = app.getProperty("Documents").toDispatch();
            doc = Dispatch.call(docs, "Open", sfileName).toDispatch();
            System.out.println("打开文档..." + sfileName);
            System.out.println("转换文档到 PDF..." + toFileName);
            File tofile = new File(toFileName);
            if (tofile.exists()) {
                tofile.delete();
            }
            Dispatch.call(doc, "SaveAs", toFileName, // FileName
                    wdFormatPDF);
            long end = System.currentTimeMillis();
            System.out.println("转换完成..用时：" + (end - start) + "ms.");
        } catch (Exception e) {
            System.out.println("========Error:转换失败：" + e.getMessage());
        } finally {
            Dispatch.call(doc, "Close", false);
            System.out.println("关闭文档");
            if (app != null)
                app.invoke("Quit", new Variant[]{});
        }
        // 如果没有这句话,winword.exe进程将不会关闭
        ComThread.Release();
    }

使用Docx4j

算是一个比较完善的方案了，没有额外的配置需要添加，也不需要安装什么插件，只要maven导包即可，且支持跨平台操作，转换完善程度尚可

踩到的坑：
1、根据项目不同可能会存在依赖冲突，需要检查maven依赖树解决
2、还是会有中文乱码问题，需要导入中文库，最好是强转一下字体(如方案2中的“宋体（中文正文）这里也会变成乱码”)
3、linux上要求你先有安装中文库才能做字体映射

未解决的问题： 空格丢失，格式略微有点乱

先来maven依赖

<!---->
<!--doc4j-->
        <dependency>
            <groupId>com.itextpdf</groupId>
            <artifactId>itextpdf</artifactId>
            <version>5.4.3</version>
        </dependency>
        <dependency>
            <groupId>org.docx4j</groupId>
            <artifactId>docx4j</artifactId>
            <version>6.1.2</version>
        </dependency>
        <dependency>
            <groupId>org.docx4j</groupId>
            <artifactId>docx4j-export-fo</artifactId>
            <version>6.0.0</version>
        </dependency>

        <!--poi-->
        <dependency>
            <groupId>xerces</groupId>
            <artifactId>xercesImpl</artifactId>
            <version>2.11.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>4.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>4.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>4.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>4.0.0</version>
        </dependency>

主要功能代码

**
     * word（docx）转pdf
     *
     * @param wordPath docx文件路径
     * @param pdfOutPath 文件输出路径
     */
    public static void convertDocx2Pdf(String wordPath, String pdfOutPath) throws IOException {
        long startTime = System.currentTimeMillis();
        OutputStream os = null;
        InputStream is = null;
        FileInputStream fis = null;
        FileOutputStream fos = null;
        try {
            fis = new FileInputStream(wordPath);
            XWPFDocument document = new XWPFDocument(fis);
            List<XWPFParagraph> paragraphs = document.getParagraphs();
            for (XWPFParagraph paragraph : paragraphs) {
                List<XWPFRun> runs = paragraph.getRuns();
                for (XWPFRun run : runs) {
                    run.setFontFamily("宋体");
                }
            }
            fos = new FileOutputStream(wordPath);
            document.write(fos);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (fis != null) {
                fis.close();
            }
            if (fos != null) {
                fos.flush();
                fos.close();
            }
        }
        try {
            is = new FileInputStream(new File(wordPath));
            WordprocessingMLPackage mlPackage = WordprocessingMLPackage.load(is);
            Mapper fontMapper = new IdentityPlusMapper();
            fontMapper.put("隶书", PhysicalFonts.get("LiSu"));
            fontMapper.put("宋体", PhysicalFonts.get("SimSun"));
            fontMapper.put("微软雅黑", PhysicalFonts.get("Microsoft Yahei"));
            fontMapper.put("黑体", PhysicalFonts.get("SimHei"));
            fontMapper.put("楷体", PhysicalFonts.get("KaiTi"));
            fontMapper.put("新宋体", PhysicalFonts.get("NSimSun"));
            fontMapper.put("华文行楷", PhysicalFonts.get("STXingkai"));
            fontMapper.put("华文仿宋", PhysicalFonts.get("STFangsong"));
            fontMapper.put("宋体扩展", PhysicalFonts.get("simsun-extB"));
            fontMapper.put("仿宋", PhysicalFonts.get("FangSong"));
            fontMapper.put("仿宋_GB2312", PhysicalFonts.get("FangSong_GB2312"));
            fontMapper.put("幼圆", PhysicalFonts.get("YouYuan"));
            fontMapper.put("华文宋体", PhysicalFonts.get("STSong"));
            fontMapper.put("华文中宋", PhysicalFonts.get("STZhongsong"));
            mlPackage.setFontMapper(fontMapper);
            os = new java.io.FileOutputStream(pdfOutPath);
            //docx4j  docx转pdf
            FOSettings foSettings = Docx4J.createFOSettings();
            foSettings.setWmlPackage(mlPackage);
            Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
            is.close();//关闭输入流
            os.close();//关闭输出流
            //输出
            System.out.println("转换完成" + (System.currentTimeMillis() - startTime) + " ms.");
        } catch (Exception e) {
            e.printStackTrace();
            try {
                if (is != null) {
                    is.close();
                }
                if (os != null) {
                    os.close();
                }
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        } finally {
            File file = new File(wordPath);
            if (file != null && file.isFile() && file.exists()) {
                file.delete();
            }
        }
    }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。