定义
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
- 字符集,就好比一本汉语字典,里面记录了很多文字(Symbol),以及顺序号(页码);
- 比较方式,是一套规则,告诉我们如何比较字典里的文字。这套规则字典里并没有记录,是根据我们具体需求来确定的,例如:可以根据页码比较,可以根据发音比较,等等。
CHARSET
以latin类字符集为例,我们有latin1、latin2、latin5、latin7四种字符集。
这是什么意思呢?简单理解,latin类比东方语系,latin1、latin2就可以类比为汉语字典、日语字典。
mysql> show character set;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
| cp866 | DOS Russian | cp866_general_ci | 1 |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 |
| macce | Mac Central European | macce_general_ci | 1 |
| macroman | Mac West European | macroman_general_ci | 1 |
| cp852 | DOS Central European | cp852_general_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
| utf16 | UTF-16 Unicode | utf16_general_ci | 4 |
| utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 |
| cp1256 | Windows Arabic | cp1256_general_ci | 1 |
| cp1257 | Windows Baltic | cp1257_general_ci | 1 |
| utf32 | UTF-32 Unicode | utf32_general_ci | 4 |
| binary | Binary pseudo charset | binary | 1 |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
+----------+-----------------------------+---------------------+--------+
40 rows in set (0.00 sec)
而latin1又有好几种比较方式,latin1_german1_ci、latin1_swedish_ci、latin1_danish_ci、latin1_german2_ci、latin1_bin等等。他们规定了latin1这本字典的不同排序方式。
mysql> SHOW COLLATION LIKE 'latin%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | Yes | 1 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | Yes | 1 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | Yes | 1 |
| latin1_general_cs | latin1 | 49 | | Yes | 1 |
| latin1_spanish_ci | latin1 | 94 | | Yes | 1 |
| latin2_czech_cs | latin2 | 2 | | Yes | 4 |
| latin2_general_ci | latin2 | 9 | Yes | Yes | 1 |
| latin2_hungarian_ci | latin2 | 21 | | Yes | 1 |
| latin2_croatian_ci | latin2 | 27 | | Yes | 1 |
| latin2_bin | latin2 | 77 | | Yes | 1 |
| latin5_turkish_ci | latin5 | 30 | Yes | Yes | 1 |
| latin5_bin | latin5 | 78 | | Yes | 1 |
| latin7_estonian_cs | latin7 | 20 | | Yes | 1 |
| latin7_general_ci | latin7 | 41 | Yes | Yes | 1 |
| latin7_general_cs | latin7 | 42 | | Yes | 1 |
| latin7_bin | latin7 | 79 | | Yes | 1 |
+---------------------+---------+----+---------+----------+---------+
19 rows in set (0.00 sec)
由上面两张表可见,Charset和Collation是一对多的关系;Charset和Default Collation是一对一的关系。
为什么要有Default Collation呢?看下面的例子就能理解:
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]
通常,创建Table的时候最多只指定了CHARSET,很少指定COLLATE。这时候系统必须给一个默认的值,否则行为就是未定义了。默认值就是Default Collation。
仅仅有Charset,没有Collation,系统会无法正确工作。就好比给你一本字典,问你“张”和“赵”,谁大?没有答案,但如果增加一个比较规则:“按照百家姓的顺序排序”,那么就可以得出“赵” > “张”了。
按照MySQL命名规则,如果知道了一个Collation,那么对应的Charset就能唯一确定。所以Create table的时候可以只指定COLLATE,CHARSET部分能自动推导出来。
Collation
每个字符串都有collation,可以显式指定:
SELECT _latin1'string' COLLATE latin1_danish_ci;
如果没有指定,则使用character_set_connection,collation_connection指定的值。
For the simple statement SELECT ‘string’, the string has the character set and collation defined by the character_set_connection and collation_connection system variables.
Coercible
两个collation不同的数据比较,应该如何操作?这里面也有一套规则指导我们报错,或者将其中一个collation转换成另外一个。具体的,参考MySQL文档, 它规定了一系列的级别,然后这套规则就根据级别比较来决定如何转换Collation。
- An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)
- The concatenation of two strings with different collations has a coercibility of 1.
- The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
- A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.
- The collation of a literal has a coercibility of 4.
- NULL or an expression that is derived from NULL has a coercibility of 5.
这些规则级别被称为COERICIBILITY。MySQL提供了一个函数来显示一个值的规则级别:
mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0
mysql> SELECT COERCIBILITY(VERSION());
-> 3
mysql> SELECT COERCIBILITY('A');
-> 4
字符集转换
字符集之间一般不能互相转换,但也有例外:一个字符集是例外一个字符集的子集时。例如Unicode和所有其他字符集之间,ASCII和GBK之间,等等。
参考文献
http://dev.mysql.com/doc/refman/5.7/en/charset.html