MySQL 全文索引统计 mysql 全文索引性能

转载

编程思想者 2023-12-03 15:46:09

文章标签 MySQL 全文索引统计全文索引搜索数据 文章分类 MySQL 数据库

MySQL 5.7.6 开始，引入了一个 ngram 全文分析器支持汉语无空格分隔符

事实上，MyISAM 存储引擎对全文索引的支持有很多的限制，例如表级别锁对性能的影响、数据文件的崩溃、崩溃后的恢复等，这使得 MyISAM 的全文索引对于很多的应用场景并不适合。所以，多数情况下的建议是使用别的解决方案，例如 Sphinx、Lucene 等等第三方的插件，亦或是使用 InnoDB 存储引擎的全文索引。

MySQL 5.6 以前的版本，只有 MyISAM 存储引擎支持全文索引；

MySQL 5.6 及以后的版本，MyISAM 和 InnoDB 存储引擎均支持全文索引;

支持的数据类型为 char、varchar、text 。

#创建表时创建全文索引

show VARIABLES like '%autocommit%';
use test;
create table fulltext_test (
id int(11) NOT NULL AUTO_INCREMENT,
content text NOT NULL,
tag varchar(255),
PRIMARY KEY (id),
FULLTEXT KEY content_tag_fulltext(content,tag)
) ENGINE=myisam DEFAULT CHARSET=utf8;
 
insert into fulltext_test(content,tag) values ('a','t1'),('b','t2'),('c','t3');
insert into fulltext_test(content,tag) values ('aa','t4'),('bb','t5'),('cc','t6');
insert into fulltext_test(content,tag) values ('aaa','t7'),('bbb','t8'),('ccc','t9');
insert into fulltext_test(content,tag) values ('aaaa','t10'),('bbbb','t11'),('cccc','t12');
commit;

#在已存在的表上创建全文索引

#创建全文索引时，可以设置并行度，通过参数 innodb_ft_sort_pll_degree 控制。

#create fulltext index content_tag_fulltext on fulltext_test(content,tag);

#或者：alter

#alter table fulltext_test add fulltext index content_tag_fulltext(content,tag);

#删除

#drop index content_tag_fulltext on fulltext_test;

#或者ALTER drop

#alter table fulltext_test drop index content_tag_fulltext;

#查询 match 和 against 关键字，需要完全匹配索引列

select * from fulltext_test where match(content,tag) against('a t1');

select * from fulltext_test where match(content,tag) against('aa t4');

select * from fulltext_test where match(content,tag) against('aaa t7');

select * from fulltext_test where match(content,tag) against('aaaa 10');

#查询包含a

select * from fulltext_test where match(content,tag) against('a*' in boolean mode);

#加or不会走索引

explain select * from fulltext_test where match(content,tag) against('a t1') or id=2;

----

use test;

create table test ( id int(11) unsigned not null auto_increment, content text not null, primary key(id), fulltext key content_index(content) ) engine=innodb default charset=utf8; insert into test (content) values ('a'),('b'),('c'); insert into test (content) values ('aa'),('bb'),('cc'); insert into test (content) values ('aaa'),('bbb'),('ccc'); insert into test (content) values ('aaaa'),('bbbb'),('cccc');

#未使用in则为精确匹配，前2个sql在默认配置中查询不到数据,mysql8中第1行在参数修改后仍未查询到结果

select * from test where match(content) against('a'); select * from test where match(content) against('aa'); select * from test where match(content) against('aaa');

select * from test where match(content) against('aaaa');

#查询包含a,mysql8中无法查询到第一行，默认为in languagemode可省略

select * from test.test where match(content) against('aa' in natural language mode);

select * from test.test where match(content) against('aa*' in boolean mode);

explain select * from test where match(content) against('aaa');

#搜索长度值域范围[最小搜索长度,最大搜索长度]

show variables like '%engine%';

#查询myisam引擎搜索长度值域

show variables like '%ft%word%';

#查询innodb引擎搜索长度值域

show variables like '%ft%token%';

#查询boolean修饰符

show variables like '%ft_boolean%';

#可以看到这两个变量在 MyISAM 和 InnoDB 两种存储引擎下的变量名和默认值

// MyISAM ft_min_word_len = 4; ft_max_word_len = 84; // InnoDB innodb_ft_min_token_size = 3; innodb_ft_max_token_size = 84;

#配置最小搜索长度

#read only variable

vi /etc/my.cnf

[mysqld] innodb_ft_min_token_size = 1 ft_min_word_len = 1

service msyqld restart

#repair修复索引，innodb不需要修复，或者直接删掉重新建立索引

repair table fulltext_test quick;

两种全文索引

1.自然语言的全文索引

默认情况下，使用 in natural language mode 修饰符时，match() 函数对文本集合执行自然语言搜索。

自然语言搜索引擎将计算每一个文档对象和查询的相关度。这里，相关度是基于匹配的关键词的个数，以及关键词在文档中出现的次数。在整个索引中出现次数越少的词语，匹配时的相关度就越高。相反，非常常见的单词将不会被搜索，如果一个词语的在超过 50% 的记录中都出现了，那么自然语言的搜索将不会搜索这类词语。上面提到的，测试表中必须有 4 条以上的记录，就是这个原因。

2.布尔全文索引

在布尔搜索match against in boolena中，通过一些前缀修饰符来定制搜索。

MySQL 内置的修饰符，ft_boolean_syntax + -><()~*:""&|

+ 必须包含该词
- 必须不包含该词
> 提高该词的相关性，查询的结果靠前
< 降低该词的相关性，查询的结果靠后
(*)星号通配符，只能接在词后面

---------------------------------------------------------------------------------------------------------------------------

mysql中文ngram全文索引

停止词(stopword)

一、什么是停止词？

不能用于搜索的词，如敏感词汇：法治、李子、器官移植等；非常常见的无任何具体含议的词汇： and、or、what 、好的、我们、你们、这样等等。

二、停止词相关参数

#innodb_ft_enable_stopword默认已经是ON

show variables like '%innodb%stop%';

innodb_ft_server_stopword_table和innodb_ft_user_stopword_table：指定停止词的 innodb 表information_schema.INNODB_FT_DEFAULT_STOPWORD，这两个参唯一的区别就是 innodb_ft_user_stopword_table 的优先级更高

select * from information_schema.INNODB_FT_DEFAULT_STOPWORD;

三、创建停止词

1、创建保存停止词的表

注：字符串长度设置不能少于 ngram_token_size * 字符长度，字符长度跟字符集有关，如：utf8 是一个中文字符占 3 个字节，如果 ngram_token_size 的值是 2 ，则 2 * 3 = 6，最少不能少于6个字节（个人理解）

create table test.stopword(value varchar(18));

insert into test.stopword values('法治),('李子'),('我们的'),('你们的');

commit;

select * from test.stopword;

四、innodb 全文索引的停止词如何设置？

set global innodb_ft_user_stopword_table='test/stopword';

#设置后未查询到值

show variables like '%innodb_ft_user_stopword_table%';

注：停止词更新后，需要重建全文索引才能生效，重建索引时，stopword 表中的停止词不再创建索引

中文分词

innodb 的中文全分词使用的是 ngram 支持，其算法是二元分词法，可以通过 ngram_token_size 参数设置分词的长度，默认是 2 ，该值越大，索引越大。

#ngram_token_size的值为2 说明以2个字符为一个词，可以设置 innodb_ft_aux_table 参数查看分词结果

show variables like '%ngram_token_size%';

show VARIABLES like '%innodb_ft_aux_table%';

一、查看分词结果

1、设置参数 innodb_ft_aux_table

#test为 schema 名， test6 为具有全文索引的表名

set global innodb_ft_aux_table='test/test6';

#若报错，可创建全文索引INNODB_FT_INDEX_CACHE表中缓存数据后后再尝试############################

[Err] 1231 - Variable 'innodb_ft_aux_table' can't be set to the value of 'test/test6'

2、查看分词结果

#可在建表时指定全文索引
CREATE TABLE `test6` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(90) DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `idx_name` (`name`) /*WITH PARSER `ngram` */
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8
 
insert into test.test6(name) values
('工地城工 有城'),
('adfadsfadsfadsf'),
('产戟 要棵地地要要地大叶黄杨'),
('宸ュ伐鍦板湴鏈夋湁鐨勭殑'),
('人人地工地'),
('霜国工直有直功发顺丰'),
('器官健康很重要'),
('要做就做大买卖'),
('官买官卖是不靠普的');
commit;
insert into test.test6(name) values ('做买卖了呀')；
commit;
 
select * from test6 where id >= 7;
 
#分词后的数据（全文索引）
#select * from information_schema.innodb_ft_index_table;
select * from information_schema.innodb_ft_index_cache;

innodb_ft_index_cache 和 innodb_ft_index_table 表的关系：

创建好全文索引后，所有的数据都保存在 innodb_ft_index_table 表中，当有新的数据 insert 后，这些新数据的全文索引保存在内存表中，即 innodb_ft_index_cache，当做optimize table 操作时，批量把表写入磁盘中，即 innodb_ft_index_table 表中。

innodb 如何创建支持中文的全文索引？

#用 alter table 创建全文索引

#对于中文全文索引必须指定 with parser ngram 关键字，否则innodb 默认的分词算法对中文的支持很差。

alter table test6 add fulltext key idx_fulltext_name(name) with parser ngram;

如何使用全文索引？

一、自然语言模式

select * from test6 ;

#使用全文索引检查

select * from test6 where match(name) against('买卖');

explain select * from test6 where match(name) against('买卖' in NATURAL LANGUAGE MODE);

二、布尔模式

select * from test6 where match(name) against('买卖' in boolean mode );

#如果不想要包含了 “要做”的行，则可以：

select * from test6 where match(name) against('买卖 -要做' in boolean mode);

#注：只有在布尔模式下才可用以上操作符。

布尔模式下支持以下操作符：

“+”表示必须包含

“-”表示必须排除

“>”表示出现该单词时增加相关性

“<”表示出现该单词时降低相关性

“*”表示通配符

“~”允许出现该单词，但是出现时相关性为负

“""”表示短语

no operation表示find word是可选的，如果出现，相关性会更高

#可以通过以下命令查看布尔模式支持的操作符：

#+ -><()~*:""&|

show variables like '%ft_boolean_syntax%';

如何维护全文索引？

一、DML操作对全文索引的影响

1、插入操作

插入操作较为简单，当往表中插入记录时，提交事务时会对全文索引上的列进行分词存储到FTS Index Cache（INNODB_FT_INDEX_CACHE），最后再批量更新到Auxiliary Table（INNODB_FT_INDEX_TABLE）中

2、删除操作

当提交删除数据的事务以后，不会删除Auxiliary Table中的数据，而只会删除FTS Index Cache中的数据。对于Auxiliary Table中被删除的记录，InnoDB存储引擎会记录其FTS Document Id，并将其保存在DELETED Auxiliary Table中。可以通过OPTIMIZE TABLE手动删除索引中的记录。

3、更新操作

4、查找操作

分为两步。第一步：根据检索词搜集符合条件的FTS_DOC_ID，在搜集满足条件的FTS_DOC_ID首先读取delete表中记录的FTS_DOC_ID，这些FTS_DOC_ID随后被用做过滤

第二步：根据FTS_DOC_ID找到对应的记录，找到的记录是根据相关性大小降序返回的

二、innodb 全文索引相关表：

select * from information_schema .INNODB_FT_DELETED;
select * from information_schema .INNODB_FT_INDEX_CACHE;
 
mysql> select table_schema, table_name from information_schema.tables where table_name like 'innodb_ft%';
 
+--------------------+----------------------------+
| table_schema       | table_name                 |
+--------------------+----------------------------+
| information_schema | INNODB_FT_CONFIG           |
| information_schema | INNODB_FT_BEING_DELETED    |
| information_schema | INNODB_FT_DELETED          |
| information_schema | INNODB_FT_DEFAULT_STOPWORD |
| information_schema | INNODB_FT_INDEX_TABLE      |
| information_schema | INNODB_FT_INDEX_CACHE      |
+--------------------+----------------------------+

INNODB_FT_DELETED ：保存的是innodb 表中删除的全文索引的doc_id，避免DML操作时重组全文索引。当 OPTIMIZE TABLE 操作时才更新重组全文索引。所以要时常的对具有全文索引的表进行 OPTIMIZE TABLE 操作，要不然INNODB_FT_DELETED会很大，导至性能问题

INNODB_FT_BEING_DELETED ：只有在作 OPTIMIZE TABLE 操作时才使用，这是一个中间表，使用的时间很短。

INNODB_FT_INDEX_CACHE ：当新插入数据时，保存新的全文索引，避免 insert 操作导致索引重组，当遇到 OPTIMIZE TABLE、关闭服务、超过 innodb_ft_cache_size或 innodb_ft_total_cache_size 的限制时才合并到主索引表中（INNODB_FT_INDEX_TABLE）。

INNODB_FT_CONFIG ：innodb 全文索上的刷新、停止词等信息。

注：要想正常使用INNODB_FT_DELETED、INNODB_FT_INDEX_CACHE等表，必需设置参数 innodb_ft_aux_table，否则看不到任何信息，该参数应该是用于调试使用

三、innodb 全文索引相关参数：

innodb_ft_aux_table：设置调式表

ngram_token_size ：分词长度

innodb_ft_server_stopword_table / innodb_ft_user_stopword_table ：设置停止词表，innodb_ft_user_stopword_table这个表的优先级更高。

innodb_ft_min_token_size / innodb_ft_max_token_size ：如果使用 ngram 全文索引中日韩语言插件，这 2 个参数不再有用。

innodb_ft_sort_pll_degree ：创建全文索引时的并行度。

innodb_ft_cache_size / innodb_ft_total_cache_size ：前一个是定议每个表的全文索引内存大小，后一个设置所有表的全文索引内存大小，如果全文索引大小超过 innodb_ft_total_cache_size 的设置，则强制同步（我想是fulltext index 的cache 大小与全文索引大小强制保持一致，意味着有内存中不能全部加载全文索引）被取消。

innodb_optimize_fulltext_only ：optimize table 操作时，只优化全文索引。set innodb_optimize_fulltext_only = 1;

#MySQL中与全文索引相关的几个变量：

SHOW VARIABLES LIKE 'ft%';

ft_query_expansion_limit 20 #查询括展时取最相关的几个值用作二次查询

ft_stopword_file (built-in) #全文索引的过滤词文件

50%的门坎限制(当查询结果很多，几乎所有记录都有，或者极少的数据，都有可能会返回非所期望的结果)，可用IN BOOLEAN MODE即可以避开50%的限制。

1. 只要使用 ><的总比没用的靠前；

2. 使用 >的一定比 <的排的靠前 (这就符合相关性提高和降低)；

3. 使用同一类的，使用的越早，排的越前。

#查询语法

1. ( )：可以通过括号来使用字条件。

eg: +aaa +(>bbb <ccc) // 找到有aaa和bbb和ccc，aaa和bbb，或者aaa和ccc(因为bbb，ccc前面没有+，所以表示可有可无)，然后 aaa&bbb > aaa&bbb&ccc > aaa&ccc