白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索

原创

小小工匠 2021-05-31 17:14:44 ©著作权

文章标签 Elasticsearch教程数据库 文章分类 数据库

©著作权归作者所有：来自51CTO博客作者小小工匠的原创作品，请联系作者获取转载授权，否则将追究法律责任

文章目录

概述
TF/IDF
链接
示例
best fields策略-dis_max

白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索_数据库

概述

继续跟中华石杉老师学习ES，第十篇

课程地址： https://www.roncoo.com/view/55

TF/IDF

Apache Lucene默认评分机制

TF (Term Frequency): 基于词项(term vector), 用来表示一个词项在某个文档中出现了多少次。
词频越高，文档得分越高
IDF (Inveres Dcoument Frequency): 基于词项（term vector）,用来告诉评分公式该词有多美的汉奸。
逆文档频率越高，词项就越罕见。评分公式利用该因子为包含罕见词项的文档加权。

term vector : 词项向量是一种针对每个文档的微型倒排索引。词项向量的每个维由词项和出现频率结对组成,还可以包含词项的位置信息。 Lucene 和 ES都默认禁用词项向量索引，如果实现某些功能比如高亮显示等需要开启该选项。

链接

官方指导： https://www.elastic.co/guide/en/elasticsearch/guide/current/_tuning_best_fields_queries.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.2/query-dsl-dis-max-query.html

数据量少的时候，dis_max不生效的问题： https://stackoverflow.com/questions/38065692/dis-max-query-isnt-looking-for-the-best-matching-clause
白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索_Elasticsearch教程_02

其他博主写的相关文章：
https://blog.csdn.net/dm_vincent/article/details/41820537

示例

ES版本 6.4.1

为了演示效果，我们把之前的forum索引删除了重建一下，

DSL如下

DSL


DELETE /forum

PUT /forum
{ "settings" : { "number_of_shards" : 1 }}

POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"tag":["java","hadoop"]}}
{"update":{"_id":"2"}}
{"doc":{"tag":["java"]}}
{"update":{"_id":"3"}}
{"doc":{"tag":["hadoop"]}}
{"update":{"_id":"4"}}
{"doc":{"tag":["java","elasticsearch"]}}


POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"tag_cnt":2}}
{"update":{"_id":"2"}}
{"doc":{"tag_cnt":1}}
{"update":{"_id":"3"}}
{"doc":{"tag_cnt":1}}
{"update":{"_id":"4"}}
{"doc":{"tag_cnt":2}}

POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"view_cnt":30}}
{"update":{"_id":"2"}}
{"doc":{"view_cnt":50}}
{"update":{"_id":"3"}}
{"doc":{"view_cnt":100}}
{"update":{"_id":"4"}}
{"doc":{"view_cnt":80}}






POST /forum/article/_bulk
{"index":{"_id":5}}
{"articleID":"DHJK-B-1395-#Ky5","userID":3,"hidden":false,"postDate":"2019-06-01","tag":["elasticsearch"],"tag_cnt":1,"view_cnt":10}

POST /forum/article/_bulk
{"update":{"_id":"5"}}
{"doc":{"postDate":"2019-05-01"}}



POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"title":"this is java and elasticsearch blog"}}
{"update":{"_id":"2"}}
{"doc":{"title":"this is java blog"}}
{"update":{"_id":"3"}}
{"doc":{"title":"this is elasticsearch blog"}}
{"update":{"_id":"4"}}
{"doc":{"title":"this is java, elasticsearch, hadoop blog"}}
{"update":{"_id":"5"}}
{"doc":{"title":"this is spark blog"}}


POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"content":"i like to write best elasticsearch article"}}
{"update":{"_id":"2"}}
{"doc":{"content":"i think java is the best programming language"}}
{"update":{"_id":"3"}}
{"doc":{"content":"i am only an elasticsearch beginner"}}
{"update":{"_id":"4"}}
{"doc":{"content":"elasticsearch and hadoop are all very good solution, i am a beginner"}}
{"update":{"_id":"5"}}
{"doc":{"content":"spark is best big data solution based on scala ,an programming language similar to java"}}

至此，数据构造完成，下面来看下dis_max是如何作用的吧

GET /forum/article/_search 

数据如下： 

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 1,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 1,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "3",
        "_score": 1,
        "_source": {
          "articleID": "JODL-X-1937-#pV7",
          "userID": 2,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "hadoop"
          ],
          "tag_cnt": 1,
          "view_cnt": 100,
          "title": "this is elasticsearch blog",
          "content": "i am only an elasticsearch beginner"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_score": 1,
        "_source": {
          "articleID": "QQPX-R-3956-#aD8",
          "userID": 2,
          "hidden": true,
          "postDate": "2017-01-02",
          "tag": [
            "java",
            "elasticsearch"
          ],
          "tag_cnt": 2,
          "view_cnt": 80,
          "title": "this is java, elasticsearch, hadoop blog",
          "content": "elasticsearch and hadoop are all very good solution, i am a beginner"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 1,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2019-05-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java"
        }
      }
    ]
  }
}

普通查询

先看下普通的DSL

GET /forum/article/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": "java solution"
          }
        },
        {
          "match": {
            "content": "java solution"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.5179626,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 1.5179626,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 1.4233948,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2019-05-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_score": 1.2832261,
        "_source": {
          "articleID": "QQPX-R-3956-#aD8",
          "userID": 2,
          "hidden": true,
          "postDate": "2017-01-02",
          "tag": [
            "java",
            "elasticsearch"
          ],
          "tag_cnt": 2,
          "view_cnt": 80,
          "title": "this is java, elasticsearch, hadoop blog",
          "content": "elasticsearch and hadoop are all very good solution, i am a beginner"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.4889865,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article"
        }
      }
    ]
  }
}

来分析一下结果

计算每个document的relevance score：每个query的分数，乘以matched query数量，除以总query数量

算一下doc2的分数

白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索_数据库_03

{ "match": { "title": "java solution" }}，针对doc2，是有一个分数的
{ "match": { "content": "java solution" }}，针对doc2，也是有一个分数的

假设分数如下，所以是两个分数加起来，比如说，1.1 + 1.2 = 2.3
matched query数量 = 2
总query数量 = 2

2.3 * 2 / 2 = 2.3

算一下doc5的分数

白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索_数据库_04

{ "match": { "title": "java solution" }}，针对doc5，是没有分数的
{ "match": { "content": "java solution" }}，针对doc5，是有一个分数的

所以说，只有一个query是有分数的，比如2.3
matched query数量 = 1
总query数量 = 2

2.3 * 1 / 2 = 1.15

doc5的分数 = 1.15 < doc2的分数 = 2.3

id=2的数据排在了前面，其实我们希望id=5的排在前面，毕竟id=5的数据 content字段既有java又有solution. 那看下dis_max吧

dis_max 查询


GET /forum/article/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "java solution"
          }
        },
        {
          "match": {
            "content": "java solution"
          }
        }
      ]
    }
  }
}

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.4233948,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 1.4233948,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2019-05-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.93952733,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_score": 0.79423964,
        "_source": {
          "articleID": "QQPX-R-3956-#aD8",
          "userID": 2,
          "hidden": true,
          "postDate": "2017-01-02",
          "tag": [
            "java",
            "elasticsearch"
          ],
          "tag_cnt": 2,
          "view_cnt": 80,
          "title": "this is java, elasticsearch, hadoop blog",
          "content": "elasticsearch and hadoop are all very good solution, i am a beginner"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.4889865,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article"
        }
      }
    ]
  }
}