MySQL在数据量很大时候,进行like %%查询会比较慢,所以可以使用MySQL全文检索。
建表
CREATE TABLE `training`.`t` (
`id` INT NOT NULL,
`content` LONGTEXT NOT NULL,
PRIMARY KEY (`id`));
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('1', '我爱');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('2', '我京');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('3', '北京');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('4', '爱北');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('5', '我');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('6', '爱');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('7', '北');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('8', '京');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('9', '京我');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('10', '我爱北');
INSERT INTO `training`.`t` (`id`, `content`) VALUES ('11', '我爱北京');
如果是英文可以选择默认的分词器:
ALTER TABLE `t` ADD FULLTEXT INDEX `idx`(`content`)
如果是中文可以选择ngram分词器:
ALTER TABLE `t` ADD FULLTEXT INDEX `idx`(`content`) WITH PARSER ngram;
ngram分词器默认2个中文分词,也就是"我爱北京"会被分词为"我爱","爱北","北京",所以查询时候也必须按照分词后的词语进行查询,可以调整ngram个数来优化分词:
show variables like 'ngram_token_size';
SET ngram_token_size = 1;
但是无法修改,报错:
Error Code: 1238. Variable 'ngram_token_size' is a read only variable
所以必须在MySQL服务器启动配置中修改这个变量。
查询
select * from t where MATCH (content ) AGAINST ('我'); // 0条
select * from t where MATCH (content ) AGAINST ('我爱');// 3条,1,10,11
select * from t where MATCH (content ) AGAINST ('北京');// 2条,3,11
select * from t where MATCH (content ) AGAINST ('爱北'); // 3条,4,10,11
select * from t where MATCH (content ) AGAINST ('我爱北'); // 2条,10,11
可以看出查询结果是严格按照分两个中文词来匹配的,即使数据库有ID=5 ”我“,检索”我“也查不出来,此时需要用like
select * from t where content like '%我%'; // 6条 1,2,5,9,10,11
所以在需要一个中文查询时候,就必须将ngram改成1.
参考
https://dev.mysql.com/doc/refman/8.3/en/fulltext-search-ngram.html