es 自定义分词时转为小写 es怎么分词

转载

智能创新者 2024-04-27 19:03:56

文章标签 es 自定义分词时转为小写百度 elasticsearch 搜索引擎 es 文章分类 架构后端开发

引入：分词的概念
环境说明：Kibana + ElasticSearch

我们百度搜索：Java学习路线

es 自定义分词时转为小写 es怎么分词_搜索引擎

可以看到高亮的字，都是我们搜索使用的关键字匹配出来的，我们在百度框框中输入的关键字，经过分词后，再通过搜索匹配，最后才将结果展示出来。

ik_smart和ik_max_word的区别

使用kibana演示分词的效果：

借助es的分词器：

类型：ik_smart，称作搜索分词

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "我爱你我的祖国母亲"
}

得到结果：

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "祖国",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "母亲",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

类型：ik_max_word，称作索引分词

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "我爱你我的祖国母亲"
}

得到结果：

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "爱你",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "你我",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "祖国",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "国母",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "母亲",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

搜索分词ik_smart是按照最初粒度分词，索引分词ik_max_word是按照最细粒度分词。可见，这两种分词效果，都会把 “祖国母亲” 分词分出来，如果我们不想分出来呢？使用我们自定义的词典，那将“祖国母亲” 存储为一个词语！

自定义词典 mydoc.dic

es 自定义分词时转为小写 es怎么分词_搜索引擎_02

es 自定义分词时转为小写 es怎么分词_百度_03

重新分词：

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "我爱你我的祖国母亲"
}

结果：

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "祖国母亲",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "我爱你我的祖国母亲"
}

结果：

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "爱你",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "你我",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "祖国母亲",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "祖国",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "国母",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "母亲",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

结果可以看到搜索分词ik_smart，按照了我们自定义的词典，粗粒度的将 “祖国母亲” 分成一个词语了

es的 text和keyword的区别

准备测试数据

POST /book/novel/_bulk
{"index": {"_id": 1}}
{"name": "Gone with the Wind", "author": "Margaret Mitchell", "date": "2018-01-01"}
{"index": {"_id": 2}}
{"name": "Robinson Crusoe", "author": "Daniel Defoe", "date": "2018-01-02"}
{"index": {"_id": 3}}
{"name": "Pride and Prejudice", "author": "Jane Austen", "date": "2018-01-01"}
{"index": {"_id": 4}}
{"name": "Jane Eyre", "author": "Charlotte Bronte", "date": "2018-01-02"}

在Kibana运行后的结果

es 自定义分词时转为小写 es怎么分词_elasticsearch_04

查看_mapping

{
  "book" : {
    "mappings" : {
      "properties" : {
        "author" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "date" : {
          "type" : "date"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

author 和 name 类型都是text，但是也有对应的keyword类型

精确查找name = Gone with the Wind 的字符串

GET book/novel/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name": "Gone with the Wind"
        }
      },
      "boost": 1.2
    }
  }
}

结果为空！

#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

这是为什么呢？

原因是Gone with the Wind经过分词器的分词处理

GET book/_analyze
{
  "field": "name",
  "text": "Gone with the Wind"
}

存储到es中就变成了

{
  "tokens" : [
    {
      "token" : "gone",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "with",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "the",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "wind",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

所以我们肯定找 name = Gone with the Wind 是匹配不到任何结果的。

但是如果我们使用

GET book/novel/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "Gone with the Wind"
        }
      },
      "boost": 1.2
    }
  }
}

得到结果

#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.2,
    "hits" : [
      {
        "_index" : "book",
        "_type" : "novel",
        "_id" : "1",
        "_score" : 1.2,
        "_source" : {
          "name" : "Gone with the Wind",
          "author" : "Margaret Mitchell",
          "date" : "2018-01-01"
        }
      }
    ]
  }
}

就可以匹配到，一条数据。

原因是name.keyword 不采用分词处理，是直接存储到es中，类似于

GET book/_analyze
{
  "field": "name.keyword",
  "text": "Gone with the Wind"
}

结果：

{
  "tokens" : [
    {
      "token" : "Gone with the Wind",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

所以我们就可以直接匹配到索引库中的数据了！

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。