MySQL创建自定义哈希索引

如果存储引擎不支持哈希索引,则可以模拟像Memory存储引擎一样创建哈希索引,这样可以享受哈希索引的便利,例如只需要很小的索引就可以为超长的键创建索引。

思路很简单:在B-Tree基础上创建一个伪哈希索引。这和真正的哈希索引不是一回事。因为还是使用真正的哈希索引进行查找,但是它使用哈希值而不键本身进行索引查找。你需要做的就是在查询的where子句中手动指定使用哈希函数。 

下面我们来看一个例子:

mysql> CREATE TABLE pseudohash(
id int unsigned not null auto_increment,
url varchar(255) not null,
url_crc int unsigned not null default 0,
primary key(id)
);

Query OK, 0 rows affected

mysql> describe pseudohash;
+---------+------------------+------+-----+---------+----------------+
| Field   | Type             | Null | Key | Default | Extra          |
+---------+------------------+------+-----+---------+----------------+
| id      | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| url     | varchar(255)     | NO   |     | NULL    |                |
| url_crc | int(10) unsigned | NO   |     | 0       |                |
+---------+------------------+------+-----+---------+----------------+
3 rows in set

我们可能会存储大量的url,并需要根据url进行搜索查找。如果使用B-Tree来存储URL,存储的内容就会很大,因为URL本身都很长。正常情况下,我们会在url列上建立索引,使用如下的方式来查找特定的url。

 创建url列上的索引

mysql> create unique index url_index on pseudohash (url);
Query OK, 0 rows affected
Records: 0  Duplicates: 0  Warnings: 0

mysql> show index from pseudohash;
+------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| pseudohash |          0 | PRIMARY   |            1 | id          | A         |           1 | NULL     | NULL   |      | BTREE      |         |               |
| pseudohash |          0 | url_index |            1 | url         | A         |           1 | NULL     | NULL   |      | BTREE      |         |               |
+------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set

插入一条数据

mysql> insert into pseudohash (url) values('www.cnivi.com');
Query OK, 1 row affected

mysql> select * from pseudohash;
+----+---------------+------------+
| id | url           | url_crc    |
+----+---------------+------------+
|  3 | www.cnivi.com |          0 |
+----+---------------+------------+
2 rows in set

你可能会使用这种方式来查询

mysql> select url from pseudohash where url="www.cnivi.com";
+---------------+
| url           |
+---------------+
| www.cnivi.com |
+---------------+
1 row in set

这种方式可见的坏处就是url本身可能很长,使用B-Tree存储的内容太大

 解决方法就是删除掉url列上的索引,而新增一个被索引的url_crc列

如下面操作:

mysql> show index from pseudohash;
+------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| pseudohash |          0 | PRIMARY   |            1 | id          | A         |           2 | NULL     | NULL   |      | BTREE      |         |               |
| pseudohash |          0 | url_index |            1 | url         | A         |           2 | NULL     | NULL   |      | BTREE      |         |               |
+------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set

mysql> drop index url_index on pseudohash;
Query OK, 0 rows affected
Records: 0  Duplicates: 0  Warnings: 0

mysql> show index from pseudohash;
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| pseudohash |          0 | PRIMARY  |            1 | id          | A         |           1 | NULL     | NULL   |      | BTREE      |         |               |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set

mysql> create index url_crc_index on pseudohash (url_crc);
Query OK, 0 rows affected
Records: 0  Duplicates: 0  Warnings: 0

mysql> show index from pseudohash;
+------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name      | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| pseudohash |          0 | PRIMARY       |            1 | id          | A         |           1 | NULL     | NULL   |      | BTREE      |         |               |
| pseudohash |          1 | url_crc_index |            1 | url_crc     | A         |           1 | NULL     | NULL   |      | BTREE      |         |               |
+------------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set

mysql>

使用CRC32做哈希,就可以使用下面这中方式查询:

mysql> insert into pseudohash (url,url_crc) values('www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz',crc32('www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz'));
Query OK, 1 row affected

mysql> select * from pseudohash;
+----+---------------------------------------------------+------------+
| id | url                                               | url_crc    |
+----+---------------------------------------------------+------------+
|  1 | www.ceit.com                                      | 3722524106 |
|  5 | www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz | 2788083859 |
+----+---------------------------------------------------+------------+
2 rows in set

mysql> select * from pseudohash where url = 'www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz' and url_crc = crc32('www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz');
+----+---------------------------------------------------+------------+
| id | url                                               | url_crc    |
+----+---------------------------------------------------+------------+
|  5 | www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz | 2788083859 |
+----+---------------------------------------------------+------------+
1 row in set

mysql>

where查询条件

where url = 'www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz' and url_crc = crc32('www.ceit.com/adasds/sdfsafwe/sadfsadfew/weqfvscxz');

这样做的性能就会很高,因为mysql优化器会使用这个选择性很高而体积很小的基于url_crc列的索引来完成查询。即使有很多个记录有相同的索引值,查找仍然很快,只需要根据哈希值做快速的整数比较就能找到索引条目,然后一一比较返回对应的行。对完整的URL字符串做索引,那就会非常慢。

处理哈希冲突的情况:要避免冲突问题,必须在where条件中带入哈希值和对应列值。

如果采用这种方式,记住不要使用SHA1()和MD5()作为哈希函数。因为这两个哈希函数计算出来的哈希值是非常长的字符串,会浪费大量空间,比较时也会更慢。

 创建触发器,维护哈希值

mysql> delimiter //
mysql> CREATE TRIGGER pseudohash_crc_insert BEFORE INSERT ON pseudohash FOR EACH ROW 
BEGIN 
SET NEW.url_crc = crc32(NEW.url);
END; //
Query OK, 0 rows affected

mysql> CREATE TRIGGER pseudohash_crc_update BEFORE UPDATE ON pseudohash FOR EACH ROW 
BEGIN 
SET NEW.url_crc = crc32(NEW.url);
END;//
Query OK, 0 rows affected

mysql> delimiter ;
mysql>

查看触发器

mysql> SELECT * FROM information_schema.triggers WHERE TRIGGER_NAME='pseudohash_crc_insert' \G
*************************** 1. row ***************************
           TRIGGER_CATALOG: def
            TRIGGER_SCHEMA: test
              TRIGGER_NAME: pseudohash_crc_insert
        EVENT_MANIPULATION: INSERT
      EVENT_OBJECT_CATALOG: def
       EVENT_OBJECT_SCHEMA: test
        EVENT_OBJECT_TABLE: pseudohash
              ACTION_ORDER: 0
          ACTION_CONDITION: NULL
          ACTION_STATEMENT: BEGIN
SET NEW.url_crc = crc32(NEW.url);
END
        ACTION_ORIENTATION: ROW
             ACTION_TIMING: BEFORE
ACTION_REFERENCE_OLD_TABLE: NULL
ACTION_REFERENCE_NEW_TABLE: NULL
  ACTION_REFERENCE_OLD_ROW: OLD
  ACTION_REFERENCE_NEW_ROW: NEW
                   CREATED: NULL
                  SQL_MODE: NO_ENGINE_SUBSTITUTION
                   DEFINER: root@localhost
      CHARACTER_SET_CLIENT: utf8
      COLLATION_CONNECTION: utf8_general_ci
        DATABASE_COLLATION: latin1_swedish_ci
1 row in set (0.09 sec)

mysql>

测试触发器

mysql> insert into pseudohash (url) values('www.12345.com');
Query OK, 1 row affected

mysql> select * from pseudohash where id = 6;
+----+---------------+------------+
| id | url           | url_crc    |
+----+---------------+------------+
|  6 | www.12345.com | 1421191900 |
+----+---------------+------------+
1 row in set

mysql> update pseudohash set url = 'www.111111.com' where id = 6;
Query OK, 1 row affected
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from pseudohash where id = 6;
+----+----------------+------------+
| id | url            | url_crc    |
+----+----------------+------------+
|  6 | www.111111.com | 1825805409 |
+----+----------------+------------+
1 row in set

mysql>

====END====