背景
数据库是用来存储和检索数据的,不同的语言和字符集具有不同的存储和检索方式。字符集规定了数据存储和检索的方式。
在讨论字符集时,需要搞清楚下面的概念:
1. 字符集(character set)是字母和符合的表示
2. 编码是某个字符集成员的内部表示
3. 校对(collation)规则是规定了字符如何进行比较
数据库的字符集包括字符集和校对规则两个概念。字符集用来定义字符串的存储方法,校对规则定义了字符串的校对方式。字符集和校对规则是一对多的关系。
MySQL字符集
MySQL 5.6.20支持40种字符集和219种校对规则。
使用SHOW CHARSET可以查看系统支持的字符集和缺省的校对规则,如big5对应的校对规则为big5_chinese_ci。
mysql> show charset;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
......
+----------+-----------------------------+---------------------+--------+
40 rows in set (0.00 sec)
使用SHOW COLLATION可以查看系统支持的校对规则。一种字符集可以支持多种校对规则,如big5字符集支持big_chinese_ci和big5_bin两种校对规则。
mysql> show collation;
+--------------------------+----------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+--------------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci | big5 | 1 | Yes | Yes | 1 |
| big5_bin | big5 | 84 | | Yes | 1 |
......
+--------------------------+----------+-----+---------+----------+---------+
219 rows in set (0.00 sec)
MySQL字符集设置
MySQL支持对服务器级、数据库级、表级及字段级进行设置。
服务器级字符集配置
使用show variables可以查看当前服务器的字符集和校对规则配置信息:
mysql> show variables like "%character%";
+--------------------------+---------------------------------------------------------+
| Variable_name | Value
|
+--------------------------+---------------------------------------------------------+
| character_set_client | utf8
|
| character_set_connection | utf8
|
| character_set_database | utf8
|
| character_set_filesystem | binary
|
| character_set_results | utf8
|
| character_set_server | utf8
|
| character_set_system | utf8
|
| character_sets_dir | C:\Program Files\MySQL\MySQL Server 5.6\share\charsets\ |
+--------------------------+---------------------------------------------------------+
8 rows in set (0.00 sec)
mysql> show variables like "%collation%";
+----------------------+-----------------+
| Variable_name | Value |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)
涉及服务器级别的系统配置信息包括character_set_server,collation_server, character_set_database等配置项,修改这些配置项的方式有多种:
1. 通过修改编译选项修改缺省的系统变量配置信息
shell> cmake . -DDEFAULT_CHARSET=latin1 -DDEFAULT_COLLATION=latin1_german1_ci
2. 通过启动参数修改系统变量配置信息
shell> mysqld
shell> mysqld --character-set-server=latin1
shell> mysqld --character-set-server=latin1 --collation-server=latin1_swedish_ci
3. 通过配置文件修改系统变量配置信息
如在Ubuntu16.04环境下,可以在/etc/mysql/my.cnf配置文件中追加下面配置,可以达到影响全局配置的效果:
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
init_connect='SET collation_connection = utf8_unicode_ci'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_unicode_ci
4. 实时修改系统变量配置
mysql> set character_set_server=utf8;
Query OK, 0 rows affected (0.00 sec)
数据库级字符集配置
创建数据库时,指定字符集及校验字符集,如果没有指定,则使用系统变量character_set_database和collation_database的缺省值:
mysql> help create database;
Name: 'CREATE DATABASE'
Description:
Syntax:
CREATE {DATABASE | SCHEMA} [IF NOT EXISTS] db_name
[create_specification] ...
create_specification:
[DEFAULT] CHARACTER SET [=] charset_name
| [DEFAULT] COLLATE [=] collation_name
CREATE DATABASE creates a database with the given name. To use this
statement, you need the CREATE privilege for the database. CREATE
SCHEMA is a synonym for CREATE DATABASE.
URL: http://dev.mysql.com/doc/refman/5.6/en/create-database.html
使用SHOW CREATE DATABASE可以查看当前指定数据库的字符集相关配置信息
mysql> show create database sakila;
+----------+-----------------------------------------------------------------+
| Database | Create Database |
+----------+-----------------------------------------------------------------+
| sakila | CREATE DATABASE `sakila` /*!40100 DEFAULT CHARACTER SET utf8 */ |
+----------+-----------------------------------------------------------------+
1 row in set (0.00 sec)
MySQL支持对已经创建的数据库的字符集配置进行修改
mysql> help alter database;
Name: 'ALTER DATABASE'
Description:
Syntax:
ALTER {DATABASE | SCHEMA} [db_name]
alter_specification ...
ALTER {DATABASE | SCHEMA} db_name
UPGRADE DATA DIRECTORY NAME
alter_specification:
[DEFAULT] CHARACTER SET [=] charset_name
| [DEFAULT] COLLATE [=] collation_name
表级及字段级字符集的配置
创建表时,可以设置表table_option来配置表的字符集和校对规则,同时也可以对字段级进行相关的配置:
mysql> help create table;
Name: 'CREATE TABLE'
Description:
Syntax:
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] tbl_name
(create_definition,...)
[table_options]
[partition_options]
......
table_option:
ENGINE [=] engine_name
| AUTO_INCREMENT [=] value
| AVG_ROW_LENGTH [=] value
| [DEFAULT] CHARACTER SET [=] charset_name
| CHECKSUM [=] {0 | 1}
| [DEFAULT] COLLATE [=] collation_name
| COMMENT [=] 'string'
| CONNECTION [=] 'connect_string'
......
data_type:
BIT[(length)]
......
| CHAR[(length)]
[CHARACTER SET charset_name] [COLLATE collation_name]
| VARCHAR(length)
[CHARACTER SET charset_name] [COLLATE collation_name]
| BINARY[(length)]
| VARBINARY(length)
| TINYBLOB
| BLOB
| MEDIUMBLOB
| LONGBLOB
| TINYTEXT [BINARY]
[CHARACTER SET charset_name] [COLLATE collation_name]
| TEXT [BINARY]
[CHARACTER SET charset_name] [COLLATE collation_name]
| MEDIUMTEXT [BINARY]
[CHARACTER SET charset_name] [COLLATE collation_name]
| LONGTEXT [BINARY]
[CHARACTER SET charset_name] [COLLATE collation_name]
| ENUM(value1,value2,value3,...)
[CHARACTER SET charset_name] [COLLATE collation_name]
| SET(value1,value2,value3,...)
[CHARACTER SET charset_name] [COLLATE collation_name]
| spatial_type
使用show create table table_name可以来查看相关的配置项。
通过alter可以对指定的表的字符集规则进行修改:
mysql> help alter table;
Name: 'ALTER TABLE'
Description:
Syntax:
ALTER [IGNORE] TABLE tbl_name
[alter_specification [, alter_specification] ...]
[partition_options]
alter_specification:
table_options
......
| CONVERT TO CHARACTER SET charset_name [COLLATE collation_name]
| [DEFAULT] CHARACTER SET [=] charset_name [COLLATE [=] collation_name]
......
如何选择合适的字符集
在满足应用的条件下,应该尽可能的选用较小的字符集。较小的字符集意味着减少了存储空间和减少网络传输的字节数,从而间接的提高了存储的性能。
如果在数据库创建阶段没有考虑选用正确的字符集,则后期更换时代价非常高,而且也是存在一定的风险的。因此,在项目初始阶段需要根据具体应用,选用正确的字符集,从而避免不必要的调整。
常用的中文字符集中,按字库大小升序排列:gb2312,gbk,utf8。由于有些偏僻的字在gb2312中无法存储,因此需要考虑偏僻字出现的概率及其影响,在不明确的情况下最好选用gbk。
由于MySQL支持对服务器级、数据库级、表级及字段级进行设置,因此在开发过程中,应该对具体应用情况进行细分,避免一刀切。
参考资料
3. MySQL必知必会