11月拉!

  • 自定义分词
PUT user
{
  "settings": {
    "analysis": {
      "analyzer": {
        "pinyin_analyzer":{
          "tokenizer":"my_piniyin"
        }
      },
      "tokenizer": {
        "my_piniyin":{
          "type":"pinyin",
          "keep_full_pinyin":true,
          "keep_original":true,
          "limit_first_letter_length":16,
          "lowercase":true,
          "remove_duplicated_term":true,
          "keep_separate_first_letter":false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword",
        "fields": {
          "my_pinyin":{
            "type":"text",
            "analyzer":"pinyin_analyzer"
          }
        }
      }
    }
  }
}

我们先创建一个索引,如上设置,settings设置好自定义索引,起名pinyin_analyzer, 标记是my_pinyin,设置pinyin分词器的各项元素,感觉比较重要的是keep_full_pinyin:true, 汉语全量转拼音。接下来我们开始分词

{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "刘德华",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}

看我们的pinyin分词已经将刘德华,分词了,还比较详细,使用term倒排查一下就出来,还是蛮好用的。

  • alias索引别名
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies",
        "alias": "myindex2",
        "filter": {
          "range": {
            "year": {
              "gte": 1
            }
          }
        }
      }
    }
  ]
}

在给一个索引添加别名的时候可以附加一个filter过滤,新的别名索引里只能查询到filter过滤后的docs

  • 复合查询
  1. 给查询算分结果*某个字段的值,提升权重
POST movies/_search
{
  "explain": true, 
  "size": 2, 
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "Old",
          "fields": ["title","genre.keyword"]
        }
      },
      "field_value_factor": {
        "field":"year",
        "modifier": "log2p",    //分值追加一个函数  _score * log(2 + factor * year)
        "factor": 0.01          //增加函数进行收敛 
      }
    }
  }
}

如上是查询title、genre中带有old或者包含old的文档,并进行相关性打分,将打分结果*字段year的值,然后进行排序。

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 47,
      "relation" : "eq"
    },
    "max_score" : 9.856819,
    "hits" : [
      {
        "_shard" : "[movies][0]",
        "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "72696",
        "_score" : 9.856819,
        "_source" : {
          "year" : 2009,
          "genre" : [
            "Comedy"
          ],
          "@version" : "1",
          "id" : "72696",
          "title" : "Old Dogs"
        },
        "_explanation" : {
          "value" : 9.856819,
          "description" : "function score, product of:",
          "details" : [
            {
              "value" : 7.3328753,
              "description" : "max of:",
              "details" : [
                {
                  "value" : 7.3328753,
                  "description" : "weight(title:old in 14201) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 7.3328753,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.3534727,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 47,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 27287,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.5246147,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.9695094,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 1.3441957,
              "description" : "min of:",
              "details" : [
                {
                  "value" : 1.3441957,
                  "description" : "field value function: log2p(doc['year'].value * factor=0.01)",
                  "details" : [ ]
                },
                {
                  "value" : 3.4028235E38,
                  "description" : "maxBoost",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[movies][0]",
        "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "50259",
        "_score" : 9.852491,
        "_source" : {
          "year" : 2006,
          "genre" : [
            "Drama"
          ],
          "@version" : "1",
          "id" : "50259",
          "title" : "Old Joy"
        },
        "_explanation" : {
          "value" : 9.852491,
          "description" : "function score, product of:",
          "details" : [
            {
              "value" : 7.3328753,
              "description" : "max of:",
              "details" : [
                {
                  "value" : 7.3328753,
                  "description" : "weight(title:old in 11233) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 7.3328753,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.3534727,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 47,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 27287,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.5246147,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.9695094,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 1.3436055,
              "description" : "min of:",
              "details" : [
                {
                  "value" : 1.3436055,
                  "description" : "field value function: log2p(doc['year'].value * factor=0.01)",
                  "details" : [ ]
                },
                {
                  "value" : 3.4028235E38,
                  "description" : "maxBoost",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

我们看一下打分详情,即为 _score * log(2+ factor * year) 

11.4更

  • 提升分值 boost mode
POST movies/_search
{
  "explain": true, 
  "size": 2, 
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "Old",
          "fields": ["title","genre.keyword"]
        }
      },
      "field_value_factor": {
        "field": "year"
      }, 
      "boost_mode": "sum"
    }
  }
}

boost_mode 有四种模式

  • multiply : 将field_value_factor中获取的数值与query中的相关性打分做乘法运算,然后进行排序
  • sum: 算分与字段值因素的和
  • min/max : 算分与字段值因素之间取最大/最小值作为相关性打分
  • replace:  使用字段值因素取代算分
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 47,
      "relation" : "eq"
    },
    "max_score" : 2020.3269,
    "hits" : [
      {
        "_shard" : "[movies][0]",
        "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "114250",
        "_score" : 2020.3269,
        "_source" : {
          "year" : 2014,
          "genre" : [
            "Comedy",
            "Drama"
          ],
          "@version" : "1",
          "id" : "114250",
          "title" : "My Old Lady"
        },
        "_explanation" : {
          "value" : 2020.3269,
          "description" : "sum of",
          "details" : [
            {
              "value" : 6.3268967,
              "description" : "max of:",
              "details" : [
                {
                  "value" : 6.3268967,
                  "description" : "weight(title:old in 23775) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 6.3268967,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.3534727,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 47,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 27287,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.4526441,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 3.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.9695094,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 2014.0,
              "description" : "min of:",
              "details" : [
                {
                  "value" : 2014.0,
                  "description" : "field value function: none(doc['year'].value * factor=1.0)",
                  "details" : [ ]
                },
                {
                  "value" : 3.4028235E38,
                  "description" : "maxBoost",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

从分析上来看,相关性的分6.3268967,而字段值因素是2014,所以总分是2020.3269

  • max_boost : 最大提升上限,此参数可以限制字段值因素的最大分值上限,所获取的分值将在这个上限范围内
POST movies/_search
{
  "explain": true, 
  "size": 1, 
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "Old",
          "fields": ["title","genre.keyword"]
        }
      },
      "field_value_factor": {
        "field": "year"
      }, 
      "boost_mode": "sum",
      "max_boost": 10
    }
  }
}

比如上面你的查询,field_value_factor的值会被限制在10(max_boost)内,最大10,因为boost_mode是sum,所以及果实查询的相关性打分加上这个字段值因素的最大值。

  • random_score 一致性随机函数
GET movies/_search
{
  "explain": true, 
  "size": 1, 
  "query": {
    "function_score": {
      "query": {
        "term": {
          "title": {
            "value": "love"
          }
        }
      },
      "random_score": {
        "seed": 314159265359,
        "field":"_seq_no"
      }
    }
  }
}

7.0之后需要random_score设置field字段,否则会报错,一致性随机函数是根据seed的的序号进行随机,如果seed的值是一样的,那么随机结果也是一致的。

  • suggest 推荐模块,原理是将查询分解为token,在索引字典里查找相似的term返回
GET movies/_search
{
  "size": 1, 
  "query": {
    "term": {
      "title": {
        "value": "lover"
      }
    }
  },
  "suggest": {
    "my_suggest": {
      "text": "lover",
      "term": {
        "field": "title",
        "suggest_mode":"popular"
      }
    }
  }
}

suggest_mode有几种常用的,比如

  • missing :  如果索引即terms => lover已经存在,则不提供建议
  • popular:  推荐出现频率更加高的词
  • always :  无论这个terms是否存在,都提供建议
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 12,
      "relation" : "eq"
    },
    "max_score" : 8.87367,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "2586",
        "_score" : 8.87367,
        "_source" : {
          "year" : 1999,
          "genre" : [
            "Comedy",
            "Crime",
            "Thriller"
          ],
          "@version" : "1",
          "id" : "2586",
          "title" : "Goodbye Lover"
        }
      }
    ]
  },
  "suggest" : {
    "my_suggest" : [
      {
        "text" : "lover",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lovers",
            "score" : 0.8,
            "freq" : 25
          },
          {
            "text" : "loved",
            "score" : 0.8,
            "freq" : 14
          },
          {
            "text" : "love",
            "score" : 0.75,
            "freq" : 355
          },
          {
            "text" : "lives",
            "score" : 0.6,
            "freq" : 40
          },
          {
            "text" : "live",
            "score" : 0.5,
            "freq" : 72
          }
        ]
      }
    ]
  }
}

推荐的信息放在自定义的数组中,有分值及频率。需要的时候可以自选。

插播一条刚才遇到的问题。线上es报错查询超过1w条

  • 我们先来了解一下es的配置index.max_result_window,es的配置,可以是全局的,也可以针对某个索引设置,默认1w条
  • 线上引起这次报错的查询来源是什么呢,是一个脚本,while取数,每次20条,没有退出条件,在平时这个脚本不会引发es报错,因为平时数据量没双十一这么高,这几天大促,数据量持续走高,所以导致了超过配置限制。
  • 如何解决这个问题呢?有几个思路,第一,因为他是脚本查询,不是前台实时查询,所以允许延迟时间,这样我们就可以采用es的scroll查询,scroll查询不是针对于实时的,它会对es进行多次查询,通过记录scroll_id+快照的方式进行查询,我们可以指定查询的时间间隔
curl -XGET 'localhost:9200/index/type/_search?scroll=1m' -d '
{
    "query": {
        "match_phase" : {
            "title" : "elasticsearch"
        }
    }
}

我们指定了scroll = 1min 即与下次查询之间最大间隔1min,超过则断联,第一次查询除了数据外还会返回一个scroll_id用作下次查询,所以下次查询就是如下查询

curl -XGET  'localhost:9200/_search/scroll'  -d'
{
    "scroll" : "1m", 
    "scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" 
}

scroll会一直向指定查询游走,直到查询到对应数据或者查不到数据或者超时断联时会停止请求。但是只是用scroll进行查询是有代价的,它会进行排序,最坏的情况下是全局排序。

  • 所以有些时候我们深度分页的情况下只想要数据,而不想排序,我们可以加上scan参数
GET /old_index/_search?search_type=scan&scroll=1m 
{
"query": { "match_all": {}},
"size": 1000
}

如上,我们只需加上search_type=scan,则可以禁止排序,从而避免全局排序。还有一种方式是使用_doc去sort得出来的结果,这个执行的效率最快,但是数据就不会有排序,适合用在只想取得所有数据的场景,示例如下

GET /old_index/_search?scroll=1m 
{
"query": { "match_all": {}},
"size": 1000,
"sort": [
        "_doc"
        ]
    }
}
  • 另外一个优化点是,在使用scroll游标查询的时候,在查询完毕的时候尽可能的清除这个scroll,这样可以减轻es的负担
DELETE 127.0.0.1:9200/_search/scroll
{
    "scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAdsMqFmVkZTBJalJWUmp5UmI3V0FYc2lQbVEAAAAAAHbDKRZlZGUwSWpSVlJqeVJiN1dBWHNpUG1RAAAAAABpX2sWclBEekhiRVpSRktHWXFudnVaQ3dIQQAAAAAAaV9qFnJQRHpIYkVaUkZLR1lxbnZ1WkN3SEEAAAAAAGlfaRZyUER6SGJFWlJGS0dZcW52dVpDd0hB"
}

继续咱们的es学习,上面只是个小查取,等大促过去之后,我再对今天出现的问题做些优化。