mysql 全文检索方案 mysql的全文索引的用处

转载

mob6454cc798a0c 2023-08-14 17:20:50

文章标签 mysql 全文检索方案 mysql 搜索 ngram 索引 文章分类 MySQL 数据库

MySQL全文索引应用

背景
了解全文索引

创建方式
使用方式

IN NATURAL LANGUAGE MODE
IN BOOLEAN MODE
WITH QUERY EXPANSION

参数

测试

背景

最近着手开发一个本地生活项目，其中，本人负责的模块之一是商品搜索。在设计过程中，研究了一些解决方案。其中，很多解决方式都倾向于采用ElasticSearch和分词器，但基于现有资源和开发时间等成本的估量，最终采用MySQL全文索引来实现。另外，这里也与我的一个开发理念有关，如果不能明确未来业务发展规模，则尽可能的采用简单的方式去开发，然后不断试错调整。

了解全文索引

从MySQL Version: 5.6开始，MySQL开始支持全文索引，它允许我们以一种类似正则匹配的方式，去匹配指定内容。关于全文索引，类型为 FULLTEXT ，只能使用于 InnoDB或 MyISAM 存储引擎，并且只能用于CHAR、VARCHAR和 TEXT类型的列。

创建方式

CREATE TABLE `goods_search` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `key_word` varchar(250) DEFAULT NULL COMMENT '关键词',
  `goods_id` varchar(20) DEFAULT NULL COMMENT '商品id',
  PRIMARY KEY (`id`),
  FULLTEXT KEY `full_goods_search` (`key_word`) 
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

or

alter table goods_searchadd fulltext index full_goods_search(key_word);

or

CREATE FULLTEXT INDEX full_goods_search ON `t_member_goods_search` (`key_word`)

使用方式

MATCH (col1,col2,...) AGAINST (expr [search_modifier])
eg.SELECT * FROM goods_search WHERE MATCH(`key_word`) AGAINST('陈*' IN BOOLEAN MODE)

关于search_modifier有三种方式：

search_modifier:
  {
       IN NATURAL LANGUAGE MODE(默认)
     | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION or WITH QUERY EXPANSION
     | IN BOOLEAN MODE
  }

IN NATURAL LANGUAGE MODE

对于该模式，官方文档的描述如下：

By default or with the IN NATURAL LANGUAGE MODE modifier, the MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST().

我的理解是，对于搜索的关键词，如下面的 “陈记顺和” ，它会将其进行分词，例如，分成 {陈记、记顺、顺和}（具体分词长度是可以设置的），然后对这个分词后的集合进行匹配，最终返回相关的数据。

mysql 全文检索方案 mysql的全文索引的用处_mysql

IN BOOLEAN MODE

关于该模式，同样给出官方的定义：

MySQL can perform boolean full-text searches using the IN BOOLEAN MODE modifier. With this modifier, certain characters have special meaning at the beginning or end of words in the search string.

使用该模式的时候，不会像 NATURAL LANGUAGE MODE 那样分词搜索，然后返回相关的数据，它会以你输入的内容作为一个整体进行相关的匹配：

mysql 全文检索方案 mysql的全文索引的用处_索引_02

如上，如果采用 NATURAL LANGUAGE MODE 模式，搜索 “顺和人” 时，是可以正常返回结果的。但在IN BOOLEAN MODE 下不行。

mysql 全文检索方案 mysql的全文索引的用处_ngram_03

WITH QUERY EXPANSION

Blind query expansion (also known as automatic relevance feedback) is enabled by adding WITH QUERY EXPANSION or IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search.

关于该模式的理解，是在 IN NATURAL LANGUAGE MODE 模式下返回的结果下，进行一个二次的分词搜索。如官方文档描述，如果搜索的单词长度很短，通过该模式，可以尽可能的返回跟其有隐藏关系的数据。

mysql 全文检索方案 mysql的全文索引的用处_索引_04

如图，在IN NATURAL LANGUAGE MODE 模式下，只能返回id=1的数据，但在WITH QUERY EXPANSION 模式下，会返回两条数据。

参数

关于分词长度大小的设置，主要和这几个参数有关：
对于InnoDB引擎：

innodb_ft_max_word_len
innodb_ft_min_word_len

对于MyISAM引擎:

ft_max_word_len
ft_min_word_len

mysql 全文检索方案 mysql的全文索引的用处_索引_05

注意，如果使用ngram parser分词器创建全文索引，那么上面的参数设置会失效，得设置ngram_token_size ，其默认值为2。该分词器可以更好得支持中文。

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser that supports Chinese, Japanese, and Korean (CJK). The ngram full-text parser is supported for use with InnoDB and MyISAM.

mysql 全文检索方案 mysql的全文索引的用处_ngram_06