What’s the difference between an Encoding, Code Page, Character Set and Unicode?

Encoding, Code Page and Character Set are often used interchangeably, even when that isn't strictly correct.  There are some distinctions though:

编码、代码页和字符集通常可以互换使用,即使这不是严格正确的。但也有一些区别:

Characters are usually thought of as the smallest element of writing that has a meaning.  It could be a punctuation mark, spacing character, letter, word, letter modifier or symbol.

文字通常被认为是文字中有意义的最小元素。它可以是标点符号、间距字符、字母、单词、字母修饰符或符号。

Character sets are a collection of characters that are useful, usually for a particular script or scripts.  Sometimes people use character set as a synonym for code page.  Character sets however can be collections without a method of coding them.  Similarly code pages could contain multiple sets of characters.

字符集是有用字符的集合,通常用于特定的一个或多个脚本。有时人们使用字符集作为代码页的同义词。但是,字符集可以是集合,而无需编码。类似地,代码页可以包含多组字符。

Code Pages, also Coded Character Sets, are character sets where each character has been assigned a numerical representation.  This allows characters to be mapped to binary values and back to the same character.  Often code pages are referenced by particular implementations, like windows code page 1252.

代码页,也就是编码字符集,是每个字符都被分配了一个数字表示的字符集。这允许将字符映射到二进制值并返回到同一个字符。代码页通常由特定实现引用,如windows代码页1252。

Encodings are a way to express a character set as actual coded data.  Often used interchangeably with the code page term, even within the MSDN documentation.  We often use the term Encodings in the managed .Net classes and code pages in windows APIs.  Managed Encodings even can accept a code page number (although I'd recommend using the names when possible).  Generally I think of Code Pages being similar to a table of characters to number and Encodings being how the characters get from character form to encoded byte form.

编码是将字符集表示为实际编码数据的一种方式。经常与代码页术语互换使用,甚至在msdn文档中也是如此。我们经常在WindowsAPI的托管.NET类和代码页中使用术语编码。托管编码甚至可以接受代码页号(尽管我建议尽可能使用名称)。一般来说,我认为代码页类似于一个字符到数字的表,编码是字符从字符形式到编码字节形式的方式。

--->>转帖者理解这句话:代码页是一张 “字符和对应的数字的”对照表,编码就是形成这个对照表的过程。

Unicode is best described at www.unicode.org. The Unicode Standard says "The Unicode Standard is the universal character encoding scheme for written characters and text.  It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally..."  Basically its kind of a enormous character set that encompasses all of the other characters.  It’s encoded in UTF-8, UTF-16 or UTF-32.  All 3 UTF encodings are representations of the same set of characters.  Windows and .Net (and many other systems) use Unicode natively, and it’s the natural preferred encoding for .Net or Windows applications.

Unicode最好在www.unicode.org上描述。Unicode标准说“Unicode标准是书写字符和文本的通用字符编码方案。它定义了一种编码多语言文本的一致方式,使文本数据能够在国际上交换……“基本上它是一种包含所有其他字符的巨大字符集。它是用utf-8、utf-16或utf-32编码的。所有3个utf编码都是同一组字符的表示。Windows和.NET(以及许多其他系统)本机使用Unicode,并且它是.NET或Windows应用程序的自然首选编码。