定义

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.

  • 字符集,就好比一本汉语字典,里面记录了很多文字(Symbol),以及顺序号(页码);
  • 比较方式,是一套规则,告诉我们如何比较字典里的文字。这套规则字典里并没有记录,是根据我们具体需求来确定的,例如:可以根据页码比较,可以根据发音比较,等等。

CHARSET

以latin类字符集为例,我们有latin1、latin2、latin5、latin7四种字符集。

这是什么意思呢?简单理解,latin类比东方语系,latin1、latin2就可以类比为汉语字典、日语字典。

mysql> show character set;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode               | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode            | utf16le_general_ci  |      4 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset       | binary              |      1 |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 |
+----------+-----------------------------+---------------------+--------+
40 rows in set (0.00 sec)

而latin1又有好几种比较方式,latin1_german1_ci、latin1_swedish_ci、latin1_danish_ci、latin1_german2_ci、latin1_bin等等。他们规定了latin1这本字典的不同排序方式。

mysql> SHOW COLLATION LIKE 'latin%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |  5 |         | Yes      |       1 |
| latin1_swedish_ci   | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci    | latin1  | 15 |         | Yes      |       1 |
| latin1_german2_ci   | latin1  | 31 |         | Yes      |       2 |
| latin1_bin          | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci   | latin1  | 48 |         | Yes      |       1 |
| latin1_general_cs   | latin1  | 49 |         | Yes      |       1 |
| latin1_spanish_ci   | latin1  | 94 |         | Yes      |       1 |
| latin2_czech_cs     | latin2  |  2 |         | Yes      |       4 |
| latin2_general_ci   | latin2  |  9 | Yes     | Yes      |       1 |
| latin2_hungarian_ci | latin2  | 21 |         | Yes      |       1 |
| latin2_croatian_ci  | latin2  | 27 |         | Yes      |       1 |
| latin2_bin          | latin2  | 77 |         | Yes      |       1 |
| latin5_turkish_ci   | latin5  | 30 | Yes     | Yes      |       1 |
| latin5_bin          | latin5  | 78 |         | Yes      |       1 |
| latin7_estonian_cs  | latin7  | 20 |         | Yes      |       1 |
| latin7_general_ci   | latin7  | 41 | Yes     | Yes      |       1 |
| latin7_general_cs   | latin7  | 42 |         | Yes      |       1 |
| latin7_bin          | latin7  | 79 |         | Yes      |       1 |
+---------------------+---------+----+---------+----------+---------+
19 rows in set (0.00 sec)

由上面两张表可见,Charset和Collation是一对多的关系;Charset和Default Collation是一对一的关系。

为什么要有Default Collation呢?看下面的例子就能理解:

CREATE DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

通常,创建Table的时候最多只指定了CHARSET,很少指定COLLATE。这时候系统必须给一个默认的值,否则行为就是未定义了。默认值就是Default Collation。

仅仅有Charset,没有Collation,系统会无法正确工作。就好比给你一本字典,问你“张”和“赵”,谁大?没有答案,但如果增加一个比较规则:“按照百家姓的顺序排序”,那么就可以得出“赵” > “张”了。

按照MySQL命名规则,如果知道了一个Collation,那么对应的Charset就能唯一确定。所以Create table的时候可以只指定COLLATE,CHARSET部分能自动推导出来。

Collation

每个字符串都有collation,可以显式指定:

SELECT _latin1'string' COLLATE latin1_danish_ci;

如果没有指定,则使用character_set_connection,collation_connection指定的值。

For the simple statement SELECT ‘string’, the string has the character set and collation defined by the character_set_connection and collation_connection system variables.

Coercible

两个collation不同的数据比较,应该如何操作?这里面也有一套规则指导我们报错,或者将其中一个collation转换成另外一个。具体的,参考MySQL文档, 它规定了一系列的级别,然后这套规则就根据级别比较来决定如何转换Collation。

  • An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)
  • The concatenation of two strings with different collations has a coercibility of 1.
  • The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
  • A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.
  • The collation of a literal has a coercibility of 4.
  • NULL or an expression that is derived from NULL has a coercibility of 5.

这些规则级别被称为COERICIBILITY。MySQL提供了一个函数来显示一个值的规则级别:

mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
        -> 0
mysql> SELECT COERCIBILITY(VERSION());
        -> 3
mysql> SELECT COERCIBILITY('A');
        -> 4

字符集转换

字符集之间一般不能互相转换,但也有例外:一个字符集是例外一个字符集的子集时。例如Unicode和所有其他字符集之间,ASCII和GBK之间,等等。

参考文献

http://dev.mysql.com/doc/refman/5.7/en/charset.html