1.简介
当自带的分词器无法满足需求时,就可以通过自定义分词来解决,自定义分词器的组成包括character filters、tokenizer和token filters三个部分。

2.Character Filters
(1).简介
在tokenizer之前对原始文本进行处理,比如增加、删除或者替换字符等,其会影响后续tokenizer解析的位置和偏移量,自带的三个功能为去除html标签和实体的html_strip、进行字符串替换操作的my_mapping(自定义)以及进行正则匹配替换的my_pattern(自定义)。

(2).query

POST /_analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<b>I'm so happy!</b>"
}
{
"tokens" : [
{
"token" : "I'm so happy!",
"start_offset" : 3,
"end_offset" : 25,
"type" : "word",
"position" : 0
}
]
}

3.Tokenizer
(1).简介
将原始文本按照一定规则切分为单词,自带的功能包括按单词分割的standard、按非字符分割的letter、按空格分割的whitespace以及按文件路径进行分割的path_hierarchy等。

(2).query

POST /_analyze
{
"tokenizer": "path_hierarchy",
"text": "/one/two/three"
}
{
"tokens" : [
{
"token" : "/one",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/one/two",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "/one/two/three",
"start_offset" : 0,
"end_offset" : 14,
"type" : "word",
"position" : 0
}
]
}

4.Token Filters
(1).简介
token filters对tokenizer的分词结果进行再加工操作,自带的功能包括将所有单词转为小写的lowercase、删除助词的stop以及添加近义词的synonym等。

(2).query

POST /_analyze
{
"tokenizer": "standard",
"text": "a Hello,world!",
"filter": [
"stop",
"lowercase",
{
"type": "ngram",
"min_gram": 3,
"max_gram": 4
}
]
}
{
"tokens" : [
{
"token" : "hel",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "hell",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "ell",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "ello",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "llo",
"start_offset" : 2,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "wor",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "worl",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "orl",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "orld",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "rld",
"start_offset" : 8,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

5.自定义分词
(1).简介
自定义分词需要在索引的配置中设置,只需要设置三个组成部分character filters、tokenizer和token filters即可。

(2).案例

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
POST /my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Is this <b>a box</b>"
}
{
"tokens" : [
{
"token" : "is",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "this",
"start_offset" : 3,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "box",
"start_offset" : 13,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}