最近做个项目,数据库存储了别的项目获取的含有中文的富文本格式字符串,这边要求转成中文进行处理,所以研究了一下,项目实施后也遇到了几个问题,在这里分享一下,本文主要关注如何将富文本中的中文正确解析出来:
主要解决问题:
1 如何解析富文本中的中文;
2 特殊的富文本格式怎么处理;
3 解析出现中文乱码怎么应对(不是编码格式原因造成的);
问题3参考https://www.dazhuanlan.com/2019/08/16/5d55fcd632b0f/
富文本:缩写问rtf,是一种标准。会以特定的符号代表样式。通过富文本组件加载富文本时,会直接显示成相应的样式。
常见富文本格式如下:
{\rtf1\ansi\ansicpg936\deff0{\fonttbl{\f0\fnil\fcharset134 \'cb\'ce\'cc\'e5;}{\f1\fnil\fcharset134 \'ba\'da\'cc\'e5;}}
{\colortbl ;\red0\green0\blue0;}
\viewkind4\uc1\pard\cf1\lang2052\ul\f0\fs22 2020\'c4\'ea07\'d4\'c219\'c8\'d506\'ca\'b100\'b7\'d6\'d6\'c1\'c1\'ed\'d3\'d0\'c3\'fc\'c1\'ee\'ca\'b1\'a3\'ac\b\f1\'b3\'c9\'d3\'e5\'cf\'df\'c4\'da\'bd\'ad\'d5\'be\'d6\'c1\'97\'c0\'c4\'be\'d5\'f2\'d5\'be\'bc\'e4220km432m\'d6\'c1230km897m\'b4\'a6\b0\f0\'b0\'b4\'d5\'d5\'a1\'b6\'b3\'c9\'b6\'bc\'be\'d6\'bc\'af\'cd\'c5\'b9\'ab\'cb\'be\'b9\'d8\'d3\'da\'b9\'ab\'b2\'bc\'d1\'b4\'c6\'da\'b7\'c0\'ba\'e9\'cf\'de\'cb\'d9\'cf\'e0\'b9\'d8LKJ\'bb\'f9\'b4\'a1\'ca\'fd\'be\'dd\'b5\'c4\'cd\'a8\'d6\'aa\'a1\'b7\'b3\'c9\'cc\'fa\'bf\'c6\'d0\'c5\'b5\'e7\'a1\'be2020\'a1\'bf231\'ba\'c5\'ce\'c4\'bc\'fe\'d2\'aa\'c7\'f3\'a3\'ac\b\f1\'bf\'cd\'b3\'b5\'cf\'de\'cb\'d960km/h\ulnone\b0\f0\'a1\'a3\'d7\'d42020\'c4\'ea07\'d4\'c219\'c8\'d506\'ca\'b100\'b7\'d6\'c6\'f0\'c7\'b0\'b7\'a22020\'c4\'ea07\'d4\'c215\'c8\'d552089\'ba\'c5\'d4\'cb\'d0\'d0\'bd\'d2\'ca\'be\'c3\'fc\'c1\'ee\'b7\'cf\'d6\'b9\'a1\'a3
\par
\par }
方法一:手动解析
将富文本中的格式去掉,只留文本内容,然后将文本内容转成中文。该方法可以用,但是有两个问题,一个是去掉格式很繁琐,还容易漏,不易维护,另一个时方法不标准,容易因为文本格式变动出一些问题。
方法二:使用swing的富文本自建辅助解析,代码如下:
public static String parseRtf(String rtfJieshi) {
String result = "";
try {
System.out.println(rtfJieshi);
//替换特殊样式符号
//rtfJieshi = transSpecialCharactersToGBK(rtfJieshi);
//替换特殊字节
//rtfJieshi = transSpecialISO(rtfJieshi);
byte[] b = rtfJieshi .getBytes();
DefaultStyledDocument styledDoc = new DefaultStyledDocument();
new RTFEditorKit().read(new ByteArrayInputStream(b), styledDoc, 0);
byte[] data = styledDoc.getText(0,styledDoc.getLength()).getBytes("ISO8859_1");
//data = inverseSpecial(data);
//注意这边加上GBK
result = new String(data,"GBK");
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
该方法使用Java swing自带组件,可以处理绝大部分情况。
中间发现部分字符,swing的富文本组件不识别,于是做了个样式符转换成GBK编码的转换器(其实就转换了四个符号,其他符号预留,转成GBK也是我这边时GBK编码才这样转):
import java.util.Dictionary;
import java.util.Hashtable;
/**
* rtf转换时,部分Special Characters无法转义成正确的gbk编码,这里在使用RTFReader解析前,先将这些Special Characters转换成相应的gkb编码
*
*/
public class ZZCTQTransKey {
static Dictionary<String, String> textKeywords = null;
static {
textKeywords = new Hashtable<String, String>();
//textKeywords.put("\\", "\\");
//textKeywords.put("{", "{");
//textKeywords.put("}", "}");
//textKeywords.put(" ", "\u00A0"); /* not in the spec... */
//textKeywords.put("~", "\u00A0"); /* nonbreaking space */
//textKeywords.put("_", "\u2011"); /* nonbreaking hyphen */
//textKeywords.put("bullet", "\u2022");
//textKeywords.put("emdash", "\u2014");
//textKeywords.put("emspace", "\u2003");
//textKeywords.put("endash", "\u2013");
//textKeywords.put("enspace", "\u2002");
textKeywords.put("\\\\ldblquote", "\\\\'a1\\\\'b0");//
textKeywords.put("\\\\lquote", "\\\\'a1\\\\'ae");//
//textKeywords.put("ltrmark", "\u200E");
textKeywords.put("\\\\rdblquote", "\\\\'a1\\\\'b1");//
textKeywords.put("\\\\rquote", "\\\\'a1\\\\'af");//
//textKeywords.put("rtlmark", "\u200F");
//textKeywords.put("tab", "\u0009");
//textKeywords.put("zwj", "\u200D");
//textKeywords.put("zwnj", "\u200C");
/* There is no Unicode equivalent to an optional hyphen, as far as
I can tell. */
//textKeywords.put("-", "\u2027");
}
}
后面使用时,又遇到了乱码,大部分文本是正确的,有时候会是乱码,仔细研究发现,在转成中文之前,byte[]数组的内容,跟开头少了字节。后来发现,是ISO5589-1的原因,ISO5589-1有部分区域是没有编码的,遇到这一部分字节是,会丢弃掉:
https://de.wikipedia.org/wiki/ISO_8859-1,它的编码区域如下图:
可以看到,在00-1F 7F-9F这两个区域,它都没有对应的字符,遇到这些字节时,ISO8859-1会将他们丢弃。我的情况是遇到了椑木镇三个字,问题出在椑字上,他的GBK编码是97C0,第一个字节97刚好落在了上面的空白区域,于是转出来以后,椑就只剩下C0了,然后C0又被自动和后面一个字节结合,就出现了一串乱码或者生僻字杂文。
解决办法:遇到00-1F和7F-9F的字节时,进行手工替换,替换成ISO8859-1能够识别的字节,然后在转回中文之前,换回来。
我的具体办法:遇到01的话,我会将01往下移动三行(这个自己定,我看编码图时,觉得下移三行好用),将01变成31,然后将31前面加上fbfcfdfeff的前缀,成为fbfcfdfeff31(这个纯属个人思路,不喜欢可以喷)。转换之后,通过swing组件将内容解析,只剩下文本内容。这时候,fbfcfdfeff31的每个字节都能被ISO8859-1识别出来,会保留下来,将它转回01,就保证了字节的完整性。最后转成中文即可。
全部代码如下:
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import java.io.ByteArrayInputStream;
import java.util.Enumeration;
public class RtfToZh {
public static String parseRtf(String rtfJieshi) {
String result = "";
try {
System.out.println(rtfJieshi);
//替换特殊样式符号
//rtfJieshi = transSpecialCharactersToGBK(rtfJieshi);
//替换特殊字节
//rtfJieshi = transSpecialISO(rtfJieshi);
byte[] b = rtfJieshi .getBytes();
DefaultStyledDocument styledDoc = new DefaultStyledDocument();
new RTFEditorKit().read(new ByteArrayInputStream(b), styledDoc, 0);
byte[] data = styledDoc.getText(0,styledDoc.getLength()).getBytes("ISO8859_1");
//data = inverseSpecial(data);
//注意这边加上GBK
result = new String(data,"GBK");
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
/**
* 将Special Characters转换成gbk编码
* @return
*/
private static String transSpecialCharactersToGBK(String rtfStr) {
if (rtfStr != null && rtfStr.length() > 0) {
for(Enumeration e = TransKey.textKeywords.keys(); e.hasMoreElements();){
String thisName=e.nextElement().toString();
String thisValue= TransKey.textKeywords.get(thisName);
rtfStr = rtfStr.replaceAll(thisName, thisValue);
}
return rtfStr;
} else {
return "";
}
}
/**
* 将ISO8859-1没有定义的字节进行转义
* @return
*/
private static String transSpecialISO(String rtfStr) {
return ISO8859ISpecialByte.conversionStr(rtfStr);
}
/**
* 将ISO8859-1没有定义的字节进行反转义
* @return
*/
private static byte[] inverseSpecial(byte[] data) {
return ISO8859ISpecialByte.inverseStr(data);
}
public static void main(String[] args) throws Exception{
String rtfStr = "{\\rtf1\\ansi\\ansicpg936\\deff0{\\fonttbl{\\f0\\fnil\\fcharset134 \\'cb\\'ce\\'cc\\'e5;}{\\f1\\fnil\\fcharset134 \\'ba\\'da\\'cc\\'e5;}}\n" +
"{\\colortbl ;\\red0\\green0\\blue0;}\n" +
"\\viewkind4\\uc1\\pard\\cf1\\lang2052\\ul\\f0\\fs22 2020\\'c4\\'ea07\\'d4\\'c219\\'c8\\'d506\\'ca\\'b100\\'b7\\'d6\\'d6\\'c1\\'c1\\'ed\\'d3\\'d0\\'c3\\'fc\\'c1\\'ee\\'ca\\'b1\\'a3\\'ac\\b\\f1\\'b3\\'c9\\'d3\\'e5\\'cf\\'df\\'c4\\'da\\'bd\\'ad\\'d5\\'be\\'d6\\'c1\\'97\\'c0\\'c4\\'be\\'d5\\'f2\\'d5\\'be\\'bc\\'e4220km432m\\'d6\\'c1230km897m\\'b4\\'a6\\b0\\f0\\'b0\\'b4\\'d5\\'d5\\'a1\\'b6\\'b3\\'c9\\'b6\\'bc\\'be\\'d6\\'bc\\'af\\'cd\\'c5\\'b9\\'ab\\'cb\\'be\\'b9\\'d8\\'d3\\'da\\'b9\\'ab\\'b2\\'bc\\'d1\\'b4\\'c6\\'da\\'b7\\'c0\\'ba\\'e9\\'cf\\'de\\'cb\\'d9\\'cf\\'e0\\'b9\\'d8LKJ\\'bb\\'f9\\'b4\\'a1\\'ca\\'fd\\'be\\'dd\\'b5\\'c4\\'cd\\'a8\\'d6\\'aa\\'a1\\'b7\\'b3\\'c9\\'cc\\'fa\\'bf\\'c6\\'d0\\'c5\\'b5\\'e7\\'a1\\'be2020\\'a1\\'bf231\\'ba\\'c5\\'ce\\'c4\\'bc\\'fe\\'d2\\'aa\\'c7\\'f3\\'a3\\'ac\\b\\f1\\'bf\\'cd\\'b3\\'b5\\'cf\\'de\\'cb\\'d960km/h\\ulnone\\b0\\f0\\'a1\\'a3\\'d7\\'d42020\\'c4\\'ea07\\'d4\\'c219\\'c8\\'d506\\'ca\\'b100\\'b7\\'d6\\'c6\\'f0\\'c7\\'b0\\'b7\\'a22020\\'c4\\'ea07\\'d4\\'c215\\'c8\\'d552089\\'ba\\'c5\\'d4\\'cb\\'d0\\'d0\\'bd\\'d2\\'ca\\'be\\'c3\\'fc\\'c1\\'ee\\'b7\\'cf\\'d6\\'b9\\'a1\\'a3\n" +
"\\par \n" +
"\\par }";
System.out.println(parseRtf(rtfStr));
}
}
import java.util.ArrayList;
import java.util.List;
/**
* 在将富文本指令解析成中文时,借助了ISO8859-1编码格式,该编码格式是单字节编码格式,不会丢失字节。
* 后来发现该编码没有对00-FF全覆盖,部分字节没有定义,这就导致转成ISO8859-1之后,部分字节会丢失。
*
* 本类处理该问题,在遇到这种字节时,转成特定的内容,然后在转中文之前,再转回来。
* 具体处理办法:
* 1 根据ISO8859-1字符集确定特殊字符范围:00-1F,7F-9F,一共65个。
* 2 遇到这些字节时,先添加前缀,然后将字节+24,形成”前缀+转移后字节“的形式比如,遇到0x01,就转成前缀+0x31。
* (这个方法一方面为了规范,另一方面,根据ISO8859-1标准,加24之后,刚好等于下移三行,不超FF,也能落到有效字节范围内)
* 3 转换成中文之前,将”前缀+转移后字节“恢复成原字节。
*/
public class ISO8859ISpecialByte {
//特殊字符,来自ISO8859-1字符集
private static String[] specialByteStrsA = {
"00", "01", "02", "03", "04", "05", "06", "07", "08", "09", "0a", "0b", "0c", "0d", "0e", "0f",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "1a", "1b", "1c", "1d", "1e", "1f",
"7f",
"80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "8a", "8b", "8c", "8d", "8e", "8f",
"90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "9a", "9b", "9c", "9d", "9e", "9f"
};
//字节转换之后的内容,这个在ISO8859-1中都有定义
private static String[] specialByteStrsB = {
"30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "3a", "3b", "3c", "3d", "3e", "3f",
"40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "4a", "4b", "4c", "4d", "4e", "4f",
"af",
"b0", "b1", "b2", "b3", "b4", "b5", "b6", "b7", "b8", "b9", "ba", "bb", "bc", "bd", "be", "bf",
"c0", "c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "ca", "cb", "cc", "cd", "ce", "cf"
};
//字节前缀,来自rtf标准格式
private static String sep = "\\\\'";
//转成iso8859-1之前,添加前缀,搞复杂一些避免后面去除前缀时,误去除
private static String conversionSuxBeforeToISO = sep + "fb" + sep + "fc" + sep + "fd" + sep + "fe" + sep + "ff";
//byte内容的前缀,和 conversionSuxBeforeToISO 保持一致
private static byte[] inverseSuxBeforeToZh = {(byte) 0xfb, (byte) 0xfc, (byte) 0xfd, (byte) 0xfe, (byte) 0xff};
/**
*
* @param rtfStr rtf标准字符串
* @return 转移后字符串,已经转义了iso8859-1未定义的字节
*/
public static String conversionStr(String rtfStr) {
if (rtfStr == null) {
return null;
}
for (int i = 0; i < specialByteStrsA.length; i++) {
String a = sep + specialByteStrsA[i];
String b = conversionSuxBeforeToISO + sep + specialByteStrsB[i];
rtfStr = rtfStr.replaceAll(a, b);
}
return rtfStr;
}
/**
*
* @param data
* @return
*/
public static byte[] inverseStr(byte[] data) {
if (data == null || data.length <= inverseSuxBeforeToZh.length) {
return data;
}
List<Byte> result = new ArrayList<Byte>();
int i = 0;
boolean flag = true;
while (flag) {
if (i > data.length - 1 - inverseSuxBeforeToZh.length) {
break;
}
if (data[i] == inverseSuxBeforeToZh[0]) {
//开始匹配conversionStr转义之后的内容
boolean find = true;
for (int j = 1; j < inverseSuxBeforeToZh.length - 1; j++) {
if (data[i + j] != inverseSuxBeforeToZh[j]) {
find = false;
break;
}
}
if (find) {
boolean match = false;
//匹配成功,准备恢复字节
for (int j = 0; j < specialByteStrsA.length; j++) {
if (specialByteStrsB[j].equalsIgnoreCase(byteToHexStr(data[i + inverseSuxBeforeToZh.length]))) {
int val = Integer.valueOf(specialByteStrsA[j], 16);
result.add((byte) val);
match = true;
break;
}
}
if (match) {
i++;
i = i + inverseSuxBeforeToZh.length;
} else {
//匹配上了前缀,但是最后的字节不满足要求,按照没匹配上继续
result.add(data[i]);
i++;
}
} else {
result.add(data[i]);
i++;
}
} else {
result.add(data[i]);
i++;
}
}
//处理最后几个字节
if (data[data.length - 1 - inverseSuxBeforeToZh.length] != inverseSuxBeforeToZh[0]) {
for (int j = inverseSuxBeforeToZh.length; j >= 1 ; j--) {
result.add(data[data.length - j]);
}
}
if (result == null || result.size() == 0) {
return new byte[]{};
} else {
byte[] resultBytes = new byte[result.size()];
for (int j = 0; j < result.size(); j++) {
resultBytes[j] = result.get(j);
}
return resultBytes;
}
}
public static String byteToHexStr(byte c) {
int byteValue = c & 0XFF;
if (c == 0) {
return "00";
} else if (c < 16 && c > 0) {
return "0" + Integer.toHexString(byteValue);
}
return Integer.toHexString(byteValue);
};
}
import java.util.Dictionary;
import java.util.Hashtable;
/**
* rtf转换时,部分Special Characters无法转义成正确的gbk编码,这里在使用RTFReader解析前,先将这些Special Characters转换成相应的gkb编码
*
*/
public class TransKey {
static Dictionary<String, String> textKeywords = null;
static {
textKeywords = new Hashtable<String, String>();
//textKeywords.put("\\", "\\");
//textKeywords.put("{", "{");
//textKeywords.put("}", "}");
//textKeywords.put(" ", "\u00A0"); /* not in the spec... */
//textKeywords.put("~", "\u00A0"); /* nonbreaking space */
//textKeywords.put("_", "\u2011"); /* nonbreaking hyphen */
//textKeywords.put("bullet", "\u2022");
//textKeywords.put("emdash", "\u2014");
//textKeywords.put("emspace", "\u2003");
//textKeywords.put("endash", "\u2013");
//textKeywords.put("enspace", "\u2002");
textKeywords.put("\\\\ldblquote", "\\\\'a1\\\\'b0");//
textKeywords.put("\\\\lquote", "\\\\'a1\\\\'ae");//
//textKeywords.put("ltrmark", "\u200E");
textKeywords.put("\\\\rdblquote", "\\\\'a1\\\\'b1");//
textKeywords.put("\\\\rquote", "\\\\'a1\\\\'af");//
//textKeywords.put("rtlmark", "\u200F");
//textKeywords.put("tab", "\u0009");
//textKeywords.put("zwj", "\u200D");
//textKeywords.put("zwnj", "\u200C");
/* There is no Unicode equivalent to an optional hyphen, as far as
I can tell. */
//textKeywords.put("-", "\u2027");
}
}