引入:分词的概念
- 环境说明:Kibana + ElasticSearch
我们百度搜索:Java学习路线
可以看到高亮的字,都是我们搜索使用的关键字匹配出来的,我们在百度框框中输入的关键字,经过分词后,再通过搜索匹配,最后才将结果展示出来。
ik_smart和ik_max_word的区别
使用kibana演示分词的效果:
借助es的分词器:
类型:ik_smart,称作搜索分词
GET _analyze
{
"analyzer": "ik_smart",
"text": "我爱你我的祖国母亲"
}
得到结果:
{
"tokens" : [
{
"token" : "我爱你",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "我",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "祖国",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "母亲",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 4
}
]
}
类型:ik_max_word,称作索引分词
GET _analyze
{
"analyzer": "ik_max_word",
"text": "我爱你我的祖国母亲"
}
得到结果:
{
"tokens" : [
{
"token" : "我爱你",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "爱你",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "你我",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "祖国",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "国母",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "母亲",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 6
}
]
}
搜索分词ik_smart是按照最初粒度分词,索引分词ik_max_word是按照最细粒度分词。可见,这两种分词效果,都会把 “祖国母亲” 分词分出来,如果我们不想分出来呢?使用我们自定义的词典,那将“祖国母亲” 存储为一个词语!
自定义词典 mydoc.dic
重新分词:
GET _analyze
{
"analyzer": "ik_smart",
"text": "我爱你我的祖国母亲"
}
结果:
{
"tokens" : [
{
"token" : "我爱你",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "我",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "祖国母亲",
"start_offset" : 5,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
}
]
}
GET _analyze
{
"analyzer": "ik_max_word",
"text": "我爱你我的祖国母亲"
}
结果:
{
"tokens" : [
{
"token" : "我爱你",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "爱你",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "你我",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "的",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "祖国母亲",
"start_offset" : 5,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "祖国",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "国母",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "母亲",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 7
}
]
}
结果可以看到 搜索分词ik_smart,按照了我们自定义的词典,粗粒度的将 “祖国母亲” 分成一个词语了
es的 text和keyword的区别
- 准备测试数据
POST /book/novel/_bulk
{"index": {"_id": 1}}
{"name": "Gone with the Wind", "author": "Margaret Mitchell", "date": "2018-01-01"}
{"index": {"_id": 2}}
{"name": "Robinson Crusoe", "author": "Daniel Defoe", "date": "2018-01-02"}
{"index": {"_id": 3}}
{"name": "Pride and Prejudice", "author": "Jane Austen", "date": "2018-01-01"}
{"index": {"_id": 4}}
{"name": "Jane Eyre", "author": "Charlotte Bronte", "date": "2018-01-02"}
在Kibana运行后的结果
查看_mapping
{
"book" : {
"mappings" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
author 和 name 类型都是text,但是也有对应的keyword类型
- 精确查找name = Gone with the Wind 的字符串
GET book/novel/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"name": "Gone with the Wind"
}
},
"boost": 1.2
}
}
}
结果为空!
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
这是为什么呢?
原因是Gone with the Wind经过分词器的分词处理
GET book/_analyze
{
"field": "name",
"text": "Gone with the Wind"
}
存储到es中就变成了
{
"tokens" : [
{
"token" : "gone",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "with",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "the",
"start_offset" : 10,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "wind",
"start_offset" : 14,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
所以我们肯定找 name = Gone with the Wind 是匹配不到任何结果的。
但是如果我们使用
GET book/novel/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"name.keyword": "Gone with the Wind"
}
},
"boost": 1.2
}
}
}
得到结果
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.2,
"hits" : [
{
"_index" : "book",
"_type" : "novel",
"_id" : "1",
"_score" : 1.2,
"_source" : {
"name" : "Gone with the Wind",
"author" : "Margaret Mitchell",
"date" : "2018-01-01"
}
}
]
}
}
就可以匹配到,一条数据。
原因是name.keyword 不采用分词处理,是直接存储到es中,类似于
GET book/_analyze
{
"field": "name.keyword",
"text": "Gone with the Wind"
}
结果:
{
"tokens" : [
{
"token" : "Gone with the Wind",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
所以我们就可以直接匹配到索引库中的数据了!