Unicode -- 从code point到UTF16的计算方法

原创

h2appy 2009-04-01 13:46:11 博主文章分类：开发 ©著作权

文章标签 Unicode 休闲 code point UTF16 文章分类 后端开发

©著作权归作者所有：来自51CTO博客作者h2appy的原创作品，请联系作者获取转载授权，否则将追究法律责任

UTF16，即是通常所说的Unicode。其实把UTF16叫成Unicode不太合适，容易给人造成混乱。因为Unicode是字符集，而不是实际的存储编码方案。

UTF16是变长编码方案。

比如Unicode code point为2F92B的字，把它保存成UTF16（也就是Windows XP记事本中的Unicode），就变成了FC D8 2B DD，如果是Big endian的话就应该是D8 FC DD 2B。这个值是怎么来的？

对于0-FFFF的Unicode字符，UTF16中用一个两个字节的Unicode code point直接表示。对于10000-10FFFF的Unicode字符，UTF16中用surrogate pair表示，既用两个字符表示，它们之间的转换过程是：

下面把code point为U+64321(十六进制)的Unicode字符编码成UTF-16，由于它大于U+FFFF，所以它要编码成surrogate pair：

v  = 0x64321
v′ = v - 0x10000
   = 0x54321
   = 0101 0100 0011 0010 0001

vh = 0101010000 // higher 10 bits of v′
vl = 1100100001 // lower  10 bits of v′
w1 = 0xD800 // the resulting 1st word is initialized with the high bits
w2 = 0xDC00 // the resulting 2nd word is initialized with the low bits

w1 = w1 | vh
   = 1101 1000 0000 0000 |
            01 0101 0000
   = 1101 1001 0101 0000
   = 0xD950

w2 = w2 | vl
   = 1101 1100 0000 0000 |
            11 0010 0001
   = 1101 1111 0010 0001
   = 0xDF21

详细描述：

The improvement that UTF-16 made over UCS-2 is its ability to encode characters in planes 1–16, not just those in plane 0 (BMP).

UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 10000₁₆ is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple word-oriented string processing, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate and 0xDC00-0xDFFF for the second, least significant surrogate.

For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character.

我们可以用Windows自带的计算器的科学计算模式完成上述计算，当然也可以自己写个小程序：）
要输入10000-10FFFF的字符，可以使用微软拼音输入法。它有一项以Unicode码输入字符的功能。
要显示这些字符中的汉字部分，可以安装Unifont，参见海峰五笔的网站。

关于编码知识，可以google一下这一系列文章，写的非常精彩：“Java中的字符集编码入门”

一点问题：
.net framework平台下，string类型变量name包含两个字符，一个是0-FFFF的字符，另一个是10000-10FFFF的字符，那么name的长度将是3而不是2，因为name有6个字节。

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Text;

using System.Windows.Forms;

namespace CodePoint2UTF16

{

public partial class Form1 : Form

{

public Form1()

{

InitializeComponent();

}

private void btnConvert_Click(object sender, EventArgs e)

{

String cp = tbUnicodeCodePoint.Text.Trim();

try

{

int n = Convert.ToInt32(cp, 16);

if (n < 0 || n > 0x10FFFF)

{

MessageBox.Show(cp + " is not in 0x0 - 0x10FFFF");

return;

}

if (n < 0x10000)

{

tbUTF16Code.Text = Convert.ToString(n, 16);

return;

}

else

{

n -= 0x10000;

int h = n >> 10;

int l = n & 0x3FF;

h |= 0xD800;

l |= 0xDC00;

tbUTF16Code.Text = Convert.ToString(h, 16) + " " + Convert.ToString(l, 16);

}

catch (Exception ex)

{

MessageBox.Show("Invalid text: " + cp + Environment.NewLine + ex.Message);

}

上一篇：根据MAC地址查询网卡厂商

下一篇：字体资源

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯