最近做个项目,数据库存储了别的项目获取的含有中文的富文本格式字符串,这边要求转成中文进行处理,所以研究了一下,项目实施后也遇到了几个问题,在这里分享一下,本文主要关注如何将富文本中的中文正确解析出来:

主要解决问题:

1 如何解析富文本中的中文;

2 特殊的富文本格式怎么处理;

3 解析出现中文乱码怎么应对(不是编码格式原因造成的);

问题3参考https://www.dazhuanlan.com/2019/08/16/5d55fcd632b0f/


富文本:缩写问rtf,是一种标准。会以特定的符号代表样式。通过富文本组件加载富文本时,会直接显示成相应的样式。

常见富文本格式如下:

{\rtf1\ansi\ansicpg936\deff0{\fonttbl{\f0\fnil\fcharset134 \'cb\'ce\'cc\'e5;}{\f1\fnil\fcharset134 \'ba\'da\'cc\'e5;}}
 {\colortbl ;\red0\green0\blue0;}
 \viewkind4\uc1\pard\cf1\lang2052\ul\f0\fs22 2020\'c4\'ea07\'d4\'c219\'c8\'d506\'ca\'b100\'b7\'d6\'d6\'c1\'c1\'ed\'d3\'d0\'c3\'fc\'c1\'ee\'ca\'b1\'a3\'ac\b\f1\'b3\'c9\'d3\'e5\'cf\'df\'c4\'da\'bd\'ad\'d5\'be\'d6\'c1\'97\'c0\'c4\'be\'d5\'f2\'d5\'be\'bc\'e4220km432m\'d6\'c1230km897m\'b4\'a6\b0\f0\'b0\'b4\'d5\'d5\'a1\'b6\'b3\'c9\'b6\'bc\'be\'d6\'bc\'af\'cd\'c5\'b9\'ab\'cb\'be\'b9\'d8\'d3\'da\'b9\'ab\'b2\'bc\'d1\'b4\'c6\'da\'b7\'c0\'ba\'e9\'cf\'de\'cb\'d9\'cf\'e0\'b9\'d8LKJ\'bb\'f9\'b4\'a1\'ca\'fd\'be\'dd\'b5\'c4\'cd\'a8\'d6\'aa\'a1\'b7\'b3\'c9\'cc\'fa\'bf\'c6\'d0\'c5\'b5\'e7\'a1\'be2020\'a1\'bf231\'ba\'c5\'ce\'c4\'bc\'fe\'d2\'aa\'c7\'f3\'a3\'ac\b\f1\'bf\'cd\'b3\'b5\'cf\'de\'cb\'d960km/h\ulnone\b0\f0\'a1\'a3\'d7\'d42020\'c4\'ea07\'d4\'c219\'c8\'d506\'ca\'b100\'b7\'d6\'c6\'f0\'c7\'b0\'b7\'a22020\'c4\'ea07\'d4\'c215\'c8\'d552089\'ba\'c5\'d4\'cb\'d0\'d0\'bd\'d2\'ca\'be\'c3\'fc\'c1\'ee\'b7\'cf\'d6\'b9\'a1\'a3
 \par 
 \par }

方法一:手动解析

将富文本中的格式去掉,只留文本内容,然后将文本内容转成中文。该方法可以用,但是有两个问题,一个是去掉格式很繁琐,还容易漏,不易维护,另一个时方法不标准,容易因为文本格式变动出一些问题。


方法二:使用swing的富文本自建辅助解析,代码如下:

public static String parseRtf(String rtfJieshi) {
    String result = "";
    try {
        System.out.println(rtfJieshi);
        //替换特殊样式符号
        //rtfJieshi = transSpecialCharactersToGBK(rtfJieshi);
        //替换特殊字节
        //rtfJieshi = transSpecialISO(rtfJieshi);
        byte[] b = rtfJieshi .getBytes();
        DefaultStyledDocument styledDoc = new DefaultStyledDocument();
        new RTFEditorKit().read(new ByteArrayInputStream(b), styledDoc, 0);
        byte[] data = styledDoc.getText(0,styledDoc.getLength()).getBytes("ISO8859_1");
        //data = inverseSpecial(data);
        //注意这边加上GBK
        result = new String(data,"GBK");
    } catch (Exception e) {
        e.printStackTrace();
    }
    return result;
}

该方法使用Java swing自带组件,可以处理绝大部分情况。

中间发现部分字符,swing的富文本组件不识别,于是做了个样式符转换成GBK编码的转换器(其实就转换了四个符号,其他符号预留,转成GBK也是我这边时GBK编码才这样转):

import java.util.Dictionary;
import java.util.Hashtable;

/**
 * rtf转换时,部分Special Characters无法转义成正确的gbk编码,这里在使用RTFReader解析前,先将这些Special Characters转换成相应的gkb编码
 *
 */
public class ZZCTQTransKey {
    static Dictionary<String, String> textKeywords = null;
    static {
        textKeywords = new Hashtable<String, String>();
        //textKeywords.put("\\",         "\\");
        //textKeywords.put("{",          "{");
        //textKeywords.put("}",          "}");
        //textKeywords.put(" ",          "\u00A0");  /* not in the spec... */
        //textKeywords.put("~",          "\u00A0");  /* nonbreaking space */
        //textKeywords.put("_",          "\u2011");  /* nonbreaking hyphen */
        //textKeywords.put("bullet",     "\u2022");
        //textKeywords.put("emdash",     "\u2014");
        //textKeywords.put("emspace",    "\u2003");
        //textKeywords.put("endash",     "\u2013");
        //textKeywords.put("enspace",    "\u2002");
        textKeywords.put("\\\\ldblquote",  "\\\\'a1\\\\'b0");//
        textKeywords.put("\\\\lquote",     "\\\\'a1\\\\'ae");//
        //textKeywords.put("ltrmark",    "\u200E");
        textKeywords.put("\\\\rdblquote",  "\\\\'a1\\\\'b1");//
        textKeywords.put("\\\\rquote",     "\\\\'a1\\\\'af");//
        //textKeywords.put("rtlmark",    "\u200F");
        //textKeywords.put("tab",        "\u0009");
        //textKeywords.put("zwj",        "\u200D");
        //textKeywords.put("zwnj",       "\u200C");
      /* There is no Unicode equivalent to an optional hyphen, as far as
         I can tell. */
        //textKeywords.put("-",          "\u2027");
    }
}

后面使用时,又遇到了乱码,大部分文本是正确的,有时候会是乱码,仔细研究发现,在转成中文之前,byte[]数组的内容,跟开头少了字节。后来发现,是ISO5589-1的原因,ISO5589-1有部分区域是没有编码的,遇到这一部分字节是,会丢弃掉:

https://de.wikipedia.org/wiki/ISO_8859-1,它的编码区域如下图:

deflate文本解密 java java解析富文本_deflate文本解密 java

可以看到,在00-1F 7F-9F这两个区域,它都没有对应的字符,遇到这些字节时,ISO8859-1会将他们丢弃。我的情况是遇到了椑木镇三个字,问题出在椑字上,他的GBK编码是97C0,第一个字节97刚好落在了上面的空白区域,于是转出来以后,椑就只剩下C0了,然后C0又被自动和后面一个字节结合,就出现了一串乱码或者生僻字杂文。

解决办法:遇到00-1F和7F-9F的字节时,进行手工替换,替换成ISO8859-1能够识别的字节,然后在转回中文之前,换回来。

我的具体办法:遇到01的话,我会将01往下移动三行(这个自己定,我看编码图时,觉得下移三行好用),将01变成31,然后将31前面加上fbfcfdfeff的前缀,成为fbfcfdfeff31(这个纯属个人思路,不喜欢可以喷)。转换之后,通过swing组件将内容解析,只剩下文本内容。这时候,fbfcfdfeff31的每个字节都能被ISO8859-1识别出来,会保留下来,将它转回01,就保证了字节的完整性。最后转成中文即可。

全部代码如下:

import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import java.io.ByteArrayInputStream;
import java.util.Enumeration;

public class RtfToZh {

    public static String parseRtf(String rtfJieshi) {
        String result = "";
        try {
            System.out.println(rtfJieshi);
            //替换特殊样式符号
            //rtfJieshi = transSpecialCharactersToGBK(rtfJieshi);
            //替换特殊字节
            //rtfJieshi = transSpecialISO(rtfJieshi);
            byte[] b = rtfJieshi .getBytes();
            DefaultStyledDocument styledDoc = new DefaultStyledDocument();
            new RTFEditorKit().read(new ByteArrayInputStream(b), styledDoc, 0);
            byte[] data = styledDoc.getText(0,styledDoc.getLength()).getBytes("ISO8859_1");
            //data = inverseSpecial(data);
            //注意这边加上GBK
            result = new String(data,"GBK");
        } catch (Exception e) {
            e.printStackTrace();
        }
        return result;
    }

    /**
     * 将Special Characters转换成gbk编码
     * @return
     */
    private static String transSpecialCharactersToGBK(String rtfStr) {
        if (rtfStr != null && rtfStr.length() > 0) {
            for(Enumeration e = TransKey.textKeywords.keys(); e.hasMoreElements();){
                String thisName=e.nextElement().toString();
                String thisValue= TransKey.textKeywords.get(thisName);
                rtfStr = rtfStr.replaceAll(thisName, thisValue);
            }
            return rtfStr;
        } else {
            return "";
        }
    }

    /**
     * 将ISO8859-1没有定义的字节进行转义
     * @return
     */
    private static String transSpecialISO(String rtfStr) {
        return ISO8859ISpecialByte.conversionStr(rtfStr);
    }

    /**
     * 将ISO8859-1没有定义的字节进行反转义
     * @return
     */
    private static byte[] inverseSpecial(byte[] data) {
        return ISO8859ISpecialByte.inverseStr(data);
    }

    public static void main(String[] args) throws Exception{
        String rtfStr = "{\\rtf1\\ansi\\ansicpg936\\deff0{\\fonttbl{\\f0\\fnil\\fcharset134 \\'cb\\'ce\\'cc\\'e5;}{\\f1\\fnil\\fcharset134 \\'ba\\'da\\'cc\\'e5;}}\n" +
                "{\\colortbl ;\\red0\\green0\\blue0;}\n" +
                "\\viewkind4\\uc1\\pard\\cf1\\lang2052\\ul\\f0\\fs22 2020\\'c4\\'ea07\\'d4\\'c219\\'c8\\'d506\\'ca\\'b100\\'b7\\'d6\\'d6\\'c1\\'c1\\'ed\\'d3\\'d0\\'c3\\'fc\\'c1\\'ee\\'ca\\'b1\\'a3\\'ac\\b\\f1\\'b3\\'c9\\'d3\\'e5\\'cf\\'df\\'c4\\'da\\'bd\\'ad\\'d5\\'be\\'d6\\'c1\\'97\\'c0\\'c4\\'be\\'d5\\'f2\\'d5\\'be\\'bc\\'e4220km432m\\'d6\\'c1230km897m\\'b4\\'a6\\b0\\f0\\'b0\\'b4\\'d5\\'d5\\'a1\\'b6\\'b3\\'c9\\'b6\\'bc\\'be\\'d6\\'bc\\'af\\'cd\\'c5\\'b9\\'ab\\'cb\\'be\\'b9\\'d8\\'d3\\'da\\'b9\\'ab\\'b2\\'bc\\'d1\\'b4\\'c6\\'da\\'b7\\'c0\\'ba\\'e9\\'cf\\'de\\'cb\\'d9\\'cf\\'e0\\'b9\\'d8LKJ\\'bb\\'f9\\'b4\\'a1\\'ca\\'fd\\'be\\'dd\\'b5\\'c4\\'cd\\'a8\\'d6\\'aa\\'a1\\'b7\\'b3\\'c9\\'cc\\'fa\\'bf\\'c6\\'d0\\'c5\\'b5\\'e7\\'a1\\'be2020\\'a1\\'bf231\\'ba\\'c5\\'ce\\'c4\\'bc\\'fe\\'d2\\'aa\\'c7\\'f3\\'a3\\'ac\\b\\f1\\'bf\\'cd\\'b3\\'b5\\'cf\\'de\\'cb\\'d960km/h\\ulnone\\b0\\f0\\'a1\\'a3\\'d7\\'d42020\\'c4\\'ea07\\'d4\\'c219\\'c8\\'d506\\'ca\\'b100\\'b7\\'d6\\'c6\\'f0\\'c7\\'b0\\'b7\\'a22020\\'c4\\'ea07\\'d4\\'c215\\'c8\\'d552089\\'ba\\'c5\\'d4\\'cb\\'d0\\'d0\\'bd\\'d2\\'ca\\'be\\'c3\\'fc\\'c1\\'ee\\'b7\\'cf\\'d6\\'b9\\'a1\\'a3\n" +
                "\\par \n" +
                "\\par }";
        System.out.println(parseRtf(rtfStr));
    }
}
import java.util.ArrayList;
import java.util.List;

/**
 * 在将富文本指令解析成中文时,借助了ISO8859-1编码格式,该编码格式是单字节编码格式,不会丢失字节。
 * 后来发现该编码没有对00-FF全覆盖,部分字节没有定义,这就导致转成ISO8859-1之后,部分字节会丢失。
 *
 * 本类处理该问题,在遇到这种字节时,转成特定的内容,然后在转中文之前,再转回来。
 * 具体处理办法:
 * 1 根据ISO8859-1字符集确定特殊字符范围:00-1F,7F-9F,一共65个。
 * 2 遇到这些字节时,先添加前缀,然后将字节+24,形成”前缀+转移后字节“的形式比如,遇到0x01,就转成前缀+0x31。
 * (这个方法一方面为了规范,另一方面,根据ISO8859-1标准,加24之后,刚好等于下移三行,不超FF,也能落到有效字节范围内)
 * 3 转换成中文之前,将”前缀+转移后字节“恢复成原字节。
 */
public class ISO8859ISpecialByte {
    //特殊字符,来自ISO8859-1字符集
    private static String[] specialByteStrsA = {
            "00", "01", "02", "03", "04", "05", "06", "07", "08", "09", "0a", "0b", "0c", "0d", "0e", "0f",
            "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "1a", "1b", "1c", "1d", "1e", "1f",
            "7f",
            "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "8a", "8b", "8c", "8d", "8e", "8f",
            "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "9a", "9b", "9c", "9d", "9e", "9f"
    };
    //字节转换之后的内容,这个在ISO8859-1中都有定义
    private static String[] specialByteStrsB = {
            "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "3a", "3b", "3c", "3d", "3e", "3f",
            "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "4a", "4b", "4c", "4d", "4e", "4f",
            "af",
            "b0", "b1", "b2", "b3", "b4", "b5", "b6", "b7", "b8", "b9", "ba", "bb", "bc", "bd", "be", "bf",
            "c0", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "ca", "cb", "cc", "cd", "ce", "cf"
    };

    //字节前缀,来自rtf标准格式
    private static String sep = "\\\\'";
    //转成iso8859-1之前,添加前缀,搞复杂一些避免后面去除前缀时,误去除
    private static String conversionSuxBeforeToISO = sep + "fb" + sep + "fc" + sep + "fd" + sep + "fe" + sep + "ff";
    //byte内容的前缀,和 conversionSuxBeforeToISO 保持一致
    private static byte[] inverseSuxBeforeToZh = {(byte) 0xfb, (byte) 0xfc, (byte) 0xfd, (byte) 0xfe, (byte) 0xff};

    /**
     *
     * @param rtfStr rtf标准字符串
     * @return 转移后字符串,已经转义了iso8859-1未定义的字节
     */
    public static String conversionStr(String rtfStr) {
        if (rtfStr == null) {
            return null;
        }
        for (int i = 0; i < specialByteStrsA.length; i++) {
            String a = sep + specialByteStrsA[i];
            String b = conversionSuxBeforeToISO + sep + specialByteStrsB[i];
            rtfStr = rtfStr.replaceAll(a, b);
        }
        return rtfStr;
    }

    /**
     *
     * @param data
     * @return
     */
    public static byte[] inverseStr(byte[] data) {
        if (data == null || data.length <= inverseSuxBeforeToZh.length) {
            return data;
        }

        List<Byte> result = new ArrayList<Byte>();
        int i = 0;
        boolean flag = true;
        while (flag) {
            if (i > data.length - 1 - inverseSuxBeforeToZh.length) {
                break;
            }

            if (data[i] == inverseSuxBeforeToZh[0]) {
                //开始匹配conversionStr转义之后的内容
                boolean find = true;
                for (int j = 1; j < inverseSuxBeforeToZh.length - 1; j++) {
                    if (data[i + j] != inverseSuxBeforeToZh[j]) {
                        find = false;
                        break;
                    }
                }
                if (find) {
                    boolean match = false;
                    //匹配成功,准备恢复字节
                    for (int j = 0; j < specialByteStrsA.length; j++) {
                        if (specialByteStrsB[j].equalsIgnoreCase(byteToHexStr(data[i + inverseSuxBeforeToZh.length]))) {
                            int val = Integer.valueOf(specialByteStrsA[j],  16);
                            result.add((byte) val);
                            match = true;
                            break;
                        }
                    }
                    if (match) {
                        i++;
                        i = i + inverseSuxBeforeToZh.length;
                    } else {
                        //匹配上了前缀,但是最后的字节不满足要求,按照没匹配上继续
                        result.add(data[i]);
                        i++;
                    }
                } else {
                    result.add(data[i]);
                    i++;
                }
            } else {
                result.add(data[i]);
                i++;
            }
        }
        //处理最后几个字节
        if (data[data.length - 1 - inverseSuxBeforeToZh.length] != inverseSuxBeforeToZh[0]) {
            for (int j = inverseSuxBeforeToZh.length; j >= 1 ; j--) {
                result.add(data[data.length - j]);
            }
        }
        if (result == null || result.size() == 0) {
            return new byte[]{};
        } else {
            byte[] resultBytes = new byte[result.size()];
            for (int j = 0; j < result.size(); j++) {
                resultBytes[j] = result.get(j);
            }
            return resultBytes;
        }
    }

    public static String byteToHexStr(byte c) {
        int byteValue = c & 0XFF;
        if (c == 0) {
            return "00";
        } else if (c < 16 && c > 0) {
            return "0" + Integer.toHexString(byteValue);
        }
        return Integer.toHexString(byteValue);

    };
}
import java.util.Dictionary;
import java.util.Hashtable;

/**
 * rtf转换时,部分Special Characters无法转义成正确的gbk编码,这里在使用RTFReader解析前,先将这些Special Characters转换成相应的gkb编码
 *
 */
public class TransKey {
    static Dictionary<String, String> textKeywords = null;
    static {
        textKeywords = new Hashtable<String, String>();
        //textKeywords.put("\\",         "\\");
        //textKeywords.put("{",          "{");
        //textKeywords.put("}",          "}");
        //textKeywords.put(" ",          "\u00A0");  /* not in the spec... */
        //textKeywords.put("~",          "\u00A0");  /* nonbreaking space */
        //textKeywords.put("_",          "\u2011");  /* nonbreaking hyphen */
        //textKeywords.put("bullet",     "\u2022");
        //textKeywords.put("emdash",     "\u2014");
        //textKeywords.put("emspace",    "\u2003");
        //textKeywords.put("endash",     "\u2013");
        //textKeywords.put("enspace",    "\u2002");
        textKeywords.put("\\\\ldblquote",  "\\\\'a1\\\\'b0");//
        textKeywords.put("\\\\lquote",     "\\\\'a1\\\\'ae");//
        //textKeywords.put("ltrmark",    "\u200E");
        textKeywords.put("\\\\rdblquote",  "\\\\'a1\\\\'b1");//
        textKeywords.put("\\\\rquote",     "\\\\'a1\\\\'af");//
        //textKeywords.put("rtlmark",    "\u200F");
        //textKeywords.put("tab",        "\u0009");
        //textKeywords.put("zwj",        "\u200D");
        //textKeywords.put("zwnj",       "\u200C");
      /* There is no Unicode equivalent to an optional hyphen, as far as
         I can tell. */
        //textKeywords.put("-",          "\u2027");
    }
}