检测字节流是否是UTF8编码

转载

mob60475702a1ff 2016-07-25 17:56:00

文章标签 ico 正则表达式字符编码 i++ 单元测试 文章分类 代码人生

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求，这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理：

UTF8的编码规则如下表

检测字节流是否是UTF8编码_正则表达式

看起来很复杂，总结起来如下：

ASCII码（U+0000 - U+007F），不编码

其余编码规则为

•第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码，n的个数说明了这个多Byte字节组字节数（包括第一个Byte）
•结下来会有n个以10开头的Byte，后6个bit存储真正的字符编码。
因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

根据这个规则，我给出的C#代码如下：

`/// <summary>` `/// Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes.` `/// </summary>` `/// <param name="inputStream">` `/// The input stream.` `/// </param>` `/// <returns>` `/// <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>.` `/// </returns>` `/// <remarks>` `/// All ASCII chars will regards not UTF8 encoding.` `/// </remarks>` `public` `static` `bool` `IsTextUTF8(``ref` `byte``[] inputStream)` `{` `int` `encodingBytesCount = 0;` `bool` `allTextsAreASCIIChars =` `true``;` `for` `(``int` `i = 0; i < inputStream.Length; i++)` `{` `byte` `current = inputStream[i];` `if` `((current & 0x80) == 0x80)` `{` `allTextsAreASCIIChars =` `false``;` `}` `// First byte` `if` `(encodingBytesCount == 0)` `{` `if` `((current & 0x80) == 0)` `{` `// ASCII chars, from 0x00-0x7F` `continue``;` `}` `if` `((current & 0xC0) == 0xC0)` `{` `encodingBytesCount = 1;` `current <<= 2;` `// More than two bytes used to encoding a unicode char.` `// Calculate the real length.` `while` `((current & 0x80) == 0x80)` `{` `current <<= 1;` `encodingBytesCount++;` `}` `}` `else` `{` `// Invalid bits structure for UTF8 encoding rule.` `return` `false``;` `}` `}` `else` `{` `// Following bytes, must start with 10.` `if` `((current & 0xC0) == 0x80)` `{` `encodingBytesCount--;` `}` `else` `{` `// Invalid bits structure for UTF8 encoding rule.` `return` `false``;` `}` `}` `}` `if` `(encodingBytesCount != 0)` `{` `// Invalid bits structure for UTF8 encoding rule.` `// Wrong following bytes count.` `return` `false``;` `}` `// Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.` `return` `!allTextsAreASCIIChars;` `}`

再附上单元测试代码：

`/// <summary>` `///This is a test class for EncodingHelperTest and is intended` `///to contain all EncodingHelperTest Unit Tests` `///</summary>` `[TestClass()]` `public` `class` `EncodingHelperTest` `{` `/// <summary>` `/// Normal test for this method.` `///</summary>` `[TestMethod()]` `public` `void` `IsTextUTF8Test()` `{` `for` `(``int` `i = 0; i < 1000; i++)` `{` `List<Char> chars =` `new` `List<``char``>();` `chars.Add(``'中'``);` `List<UnicodeCategory> temp =` `new` `List<UnicodeCategory>();` `Random rd =` `new` `Random((``int``)(DateTime.Now.Ticks & 0x7FFFFFFF));` `for` `(``int` `j = 0; j < 255; j++)` `{` `char` `ch = (``char``)rd.Next(0xFFFF);` `UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);` `if` `(uc == UnicodeCategory.Surrogate \|\|` `// Single surrogate could not be encoding correctly.` `uc == UnicodeCategory.PrivateUse \|\|` `// Private use blocks should be excluded.` `uc == UnicodeCategory.OtherNotAssigned` `)` `{` `j--;` `}` `else` `{` `chars.Add(ch);` `temp.Add(uc);` `}` `}` `string` `str =` `new` `string``(chars.ToArray());` `byte``[] inputStream = Encoding.UTF8.GetBytes(str);` `bool` `expected =` `true``;` `bool` `actual;` `actual = EncodingHelper.IsTextUTF8(``ref` `inputStream);` `Assert.AreEqual(expected, actual,` `string``.Format(``"UTF8_Assert Fails at:{0}"``, str));` `inputStream = Encoding.GetEncoding(932).GetBytes(str);` `expected =` `false``;` `actual = EncodingHelper.IsTextUTF8(``ref` `inputStream);` `Assert.AreEqual(expected, actual,` `string``.Format(``"ShiftJIS_Assert Fails at:{0}"``, str));` `}` `}` `/// <summary>` `/// Check with All ASCII chars` `/// </summary>` `[TestMethod]` `public` `void` `IsTextUTF8Test_AllASCII()` `{` `string` `str =` `"ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD"``;` `byte``[] inputStream = Encoding.UTF8.GetBytes(str);` `bool` `expected =` `false``;` `bool` `actual;` `actual = EncodingHelper.IsTextUTF8(``ref` `inputStream);` `Assert.AreEqual(expected, actual,` `string``.Format(``"UTF8_Assert Fails at:{0}"``, str));` `}` `}`