摘记—Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Cha

转载

mb5fe5608dce902 2017-04-19 18:09:00

文章标签 ico sed 后缀不兼容 emacs 文章分类 Java 后端开发

What a Unicode string ?

The binaries in RAM have the final word. NOT string literals in Text Editor(VS, Emacs), but the executable binary(in .str section) or binary data file(like cookie cache file in some sort of encoding) and loaded into variables / data structures like std::string.

string literal only tells the compiler to treat string literals as UTF-8 or UTF-16( L / _T() ), and thus come into .str section of a executable file image on disk .

code point : index to letters in code pages.

code pages ：non-ASCII values (values greater than 127) represent international characters. These code pages are used natively in Windows Me, and are also available on Windows NT and later.

0－31 ： ANSI unprintable

32 － 127：ANSI printable

摘记—Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Cha_ico

128 + ： OEM charsets －> (codified into ANSI) : ANSI code pages （ IBM，Ｍ＄）

在 Unicode 使用之前，通过DBCS来操纵编码 single／double byte 混合的char。Joel 称之为， messy system。尤其突出的，是char分界的问题，比方，s++ and s-- 和 Windows' AnsiNext and AnsiPrev 。

Unicode 通过fixed的２个byte。非常好地划定界限。

可是有例如以下的特点：

（１）通过debate解决的：UTF-16的 non－ANSI 的字符集合。而且，因此导致UTF－16事实上并不仅是65536种可能字符。

（２）在UTF－１６中，１２８下面的每一个char都会被扩展到２bytes，与原本的ANSI不兼容：须要改动之前的代码。

Windows API 在 NT 之后採用了UTF－１６，因此，非常多API加上了A或者W的后缀。

（The "A" version handles text based on Windows code pages, while the "W" version handles Unicode text. ）

对于英语国家的人来讲。事实上ANSI已经够用了。

（3） 2个byte自然有先后的问题，于是，须要加入BOM头来识别是little／big endian。

UTF－16 因为浪费空间的问题，被“冷遇”了几年，直到做出改进，得到UTF－8。

UTF－8是一个“变长”的编码系统。ANSI部分（0－127）是1byte的编码，这样，能够seamlessly和ANSI对接，而且不须要改动古老的代码。

之后，有2～6bytes不等的编码。但共同点是：没有一个byte是0x0。这一点，对于 old string-processing code that wants to use a single 0 byte as the null-terminator 就不会盲目截断strings了。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：053第244题

下一篇：「Java Web」主页静态化的实现

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

摘记—Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Cha

摘记—Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Cha

51CTO博客