1. 自定义分词器
- 当内置分析仪无法满足您的需求时,您可以创建
custom
使用以下各项的适当组合的分析器:
| 内置或自定义的标记器。(需要) |
| 内置或自定义字符过滤器的可选数组 。 |
| 内置或自定义令牌过滤器的可选数组 。 |
| 在为文本值数组建立索引时,Elasticsearch在一个值的最后一项和下一个值的第一项之间插入一个假的“空白”,以确保词组查询与来自不同数组元素的两项不匹配。默认为 |
"settings":{
"analysis": { # 自定义分词
"filter": {
"自定义过滤器": {
"type": "edge_ngram", # 过滤器类型
"min_gram": "1", # 最小边界
"max_gram": "6" # 最大边界
}
}, # 过滤器
"char_filter": {}, # 字符过滤器
"tokenizer": {}, # 分词
"analyzer": {
"自定义分词器名称": {
"type": "custom",
"tokenizer": "上述自定义分词名称或自带分词",
"filter": [
"上述自定义过滤器名称或自带过滤器"
],
"char_filter": [
"上述自定义字符过滤器名称或自带字符过滤器"
]
}
} # 分词器
}
}
// 将type设置为custom告诉Elasticsearch我们正在定义一个custom分析器。将此与配置内置分析器的方式进行比较:type将设置为内置分析器的名称,like standard or simple.。
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>déjà vu</b>?"
}
// [ is, this, deja, vu ]
Character Filter
Mapping Character Filter, 配置为将:)替换为_happy_和:(替换为_sad_
Tokenizer
Pattern Tokenizer 配置为分割标点符号
Token Filters
- Lowercase Token Filter
- Stop Token Filter, 配置为使用英语停用词的预定义列表
这是一个更复杂的示例,结合了以下内容:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
//为索引分配一个默认的自定义分析器my_custom_analyzer。该分析器使用在请求中稍后定义的自定义标记生成器,字符过滤器和标记过滤器。
"my_custom_analyzer": {
"type": "custom",
"char_filter": [
"emoticons"
],
"tokenizer": "punctuation",
"filter": [
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": { // 定义自定义标点符号器。
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { // 定义自定义表情符号字符过滤器。
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { // 定义自定义english_stop令牌过滤器。
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I'm a :) person, and you?"
}
[ i'm, _happy_, person, you ]
1.查询指定索引库的分词器效果
POST /discovery-user/_analyze
{
"analyzer": "analyzer_ngram",
"text":"i like cats"
}
2.查询所有索引库通用的分词器效果
POST _analyze
{
"analyzer": "standard", # english,ik_max_word,ik_smart
"text":"i like cats"
}
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
", => "
]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"expand": true,
"synonyms": [
"lileilei => leileili",
"hanmeimei => meimeihan"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [
"my_char_filter"
],
"filter": [
"my_synonym_filter"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\\;"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
----------------------------
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Li,LeiLei"
}
2. ik分词器
POST _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国人民大会堂"
}
POST _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国人民大会堂"
}
分词器:
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
字母分词器在遇到不是字母的字符时会将文本分解为多个词。对于大多数欧洲语言来说,它的工作是合理的,但是对于某些亚洲语言来说,这是很糟糕的,因为亚洲语言中的单词没有空格。
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
3. Suggest功能
3.1 suggest 字段
- "preserve_separators": false, 这个设置为false,将忽略空格之类的分隔符
- "preserve_position_increments": true,如果建议词第一个词是停用词,我们使用了过滤停用词的分析器,需要将此设置为false
总结来说,当使用completion suggester的时候, 不是用于完成 类似于 "关键词"这样的模糊匹配场景,而是用于完成关键词前缀匹配的。 对于汉字的处理,无需使用ik/ HanLP一类的分词器,直接使用keyword analyzer,配合去除一些不需要的stop word即可。
举个例子,做火车站站名的自动提示补全,你可能希望用户输入“上海” 或者 “虹桥” 都提示"上海虹桥火车站“ 。 如果想使用completion suggester来做,正确的方法是为"上海虹桥火车站“这个站名准备2个completion词条,分别是:
"上海虹桥火车站"
"虹桥火车站"
这样用户的输入不管是从“上海”开始还是“虹桥”开始,都可以得到"上海虹桥火车站"的提示。
因此想要实现completion suggest 中文拼音混合提示,需要提供三个字段,中文字段,采用standard分词,全拼字段,首字母字段,对汉字都采用standard分词,分词后对单字进行分词,确保FST索引的都是单字对应的拼音,这样应该就可以完成中英文拼音suggest
第一步是先采用汉字前缀匹配的结果,使用全拼匹配也可以返回结果,但是存在同音字时,weight高的同音字会覆盖原来的字,导致suggest不准确
第二部,当汉字匹配数量不够时,启用全拼匹配,可以达到拼音纠错补充效果,索引时只索引全拼拼音
第三步:正常来说首字母拼音一般匹配不到内容,此时可以使用拼音首字母匹配,索引时只索引首字母拼音
第四步:前面匹配的Suggest词不够时,最后也可以采用fuzzy查询进行补全
3.2 使用fuzzy模糊查询
fuzzy模糊查询是基于编辑距离算法来匹配文档。编辑距离的计算基于我们提供的查询词条和被搜索文档。
Complete suggest支持fuzzy查询,计算编辑距离对CPU消耗比较大,需要设置以下参数来限制对性能的影响:
1.prefix_length 不能被 “模糊化” 的初始字符数。 大部分的拼写错误发生在词的结尾,而不是词的开始。 例如通过将 prefix_length 设置为 3 ,你可能够显著降低匹配的词项数量。
2.min_length 开始进行模糊匹配的最小输入长度
3.fuzzy查询只在前缀匹配数不够时启用进行补全
3.3 ES setting mapping配置
{
"settings": {
"analysis": {
"analyzer": {
"prefix_pinyin_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"prefix_pinyin"
]
},
"full_pinyin_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"full_pinyin"
]
}
},
"filter": {
"_pattern": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([0-9])",
"([a-z])"
]
},
"prefix_pinyin": {
"type": "pinyin",
"keep_first_letter": true,
"keep_full_pinyin": false,
"none_chinese_pinyin_tokenize": false,
"keep_original": false
},
"full_pinyin": {
"type": "pinyin",
"keep_first_letter": false,
"keep_full_pinyin": true,
"keep_original": false,
"keep_none_chinese_in_first_letter": false
}
}
}
},
"mappings": {
"suggest": {
"properties": {
"id": {
"type": "string"
},
"suggestText": {
"type": "completion",
"analyzer": "standard",
"payloads": true,
"preserve_separators": false,
"preserve_position_increments": true,
"max_input_length": 50
},
"prefix_pinyin": {
"type": "completion",
"analyzer": "prefix_pinyin_analyzer",
"search_analyzer": "standard",
"preserve_separators": false,
"payloads": true
},
"full_pinyin": {
"type": "completion",
"analyzer": "full_pinyin_analyzer",
"search_analyzer": "full_pinyin_analyzer",
"preserve_separators": false,
"payloads": true
}
}
}
}
}
3.4 java API代码参考
LinkedHashSet<String> returnSet = new LinkedHashSet<>();
Client client = elasticsearchTemplate.getClient();
SuggestRequestBuilder suggestRequestBuilder = client.prepareSuggest(elasticsearchTemplate.getPersistentEntityFor(SuggestEntity.class).getIndexName());
//全拼前缀匹配
CompletionSuggestionBuilder fullPinyinSuggest = new CompletionSuggestionBuilder("full_pinyin_suggest")
.field("full_pinyin").text(input).size(10);
//汉字前缀匹配
CompletionSuggestionBuilder suggestText = new CompletionSuggestionBuilder("suggestText")
.field("suggestText").text(input).size(size);
//拼音搜字母前缀匹配
CompletionSuggestionBuilder prefixPinyinSuggest = new CompletionSuggestionBuilder("prefix_pinyin_text")
.field("prefix_pinyin").text(input).size(size);
suggestRequestBuilder = suggestRequestBuilder.addSuggestion(fullPinyinSuggest).addSuggestion(suggestText).addSuggestion(prefixPinyinSuggest);
SuggestResponse suggestResponse = suggestRequestBuilder.execute().actionGet();
Suggest.Suggestion prefixPinyinSuggestion = suggestResponse.getSuggest().getSuggestion("prefix_pinyin_text");
Suggest.Suggestion fullPinyinSuggestion = suggestResponse.getSuggest().getSuggestion("full_pinyin_suggest");
Suggest.Suggestion suggestTextsuggestion = suggestResponse.getSuggest().getSuggestion("suggestText");
List<Suggest.Suggestion.Entry> entries = suggestTextsuggestion.getEntries();
//汉字前缀匹配
for (Suggest.Suggestion.Entry entry : entries) {
List<Suggest.Suggestion.Entry.Option> options = entry.getOptions();
for (Suggest.Suggestion.Entry.Option option : options) {
returnSet.add(option.getText().toString());
}
}
//全拼suggest补充
if (returnSet.size() < 10) {
List<Suggest.Suggestion.Entry> fullPinyinEntries = fullPinyinSuggestion.getEntries();
for (Suggest.Suggestion.Entry entry : fullPinyinEntries) {
List<Suggest.Suggestion.Entry.Option> options = entry.getOptions();
for (Suggest.Suggestion.Entry.Option option : options) {
if (returnSet.size() < 10) {
returnSet.add(option.getText().toString());
}
}
}
}
//首字母拼音suggest补充
if (returnSet.size() == 0) {
List<Suggest.Suggestion.Entry> prefixPinyinEntries = prefixPinyinSuggestion.getEntries();
for (Suggest.Suggestion.Entry entry : prefixPinyinEntries) {
List<Suggest.Suggestion.Entry.Option> options = entry.getOptions();
for (Suggest.Suggestion.Entry.Option option : options) {
returnSet.add(option.getText().toString());
}
}
}
return new ArrayList<>(returnSet);
3.5 query DSL
GET /ddd/_search
{
"text": "cy",
"prefix_pinyin": {
"completion": {
"field": "prefix_pinyin",
"size": 10
}
},
"full_pinyin": {
"completion": {
"field": "full_pinyin",
"size": 10
}
},
"suggestText": {
"completion": {
"field": "suggestText",
"size": 10
}
}
}