11月拉!
- 自定义分词
PUT user
{
"settings": {
"analysis": {
"analyzer": {
"pinyin_analyzer":{
"tokenizer":"my_piniyin"
}
},
"tokenizer": {
"my_piniyin":{
"type":"pinyin",
"keep_full_pinyin":true,
"keep_original":true,
"limit_first_letter_length":16,
"lowercase":true,
"remove_duplicated_term":true,
"keep_separate_first_letter":false
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "keyword",
"fields": {
"my_pinyin":{
"type":"text",
"analyzer":"pinyin_analyzer"
}
}
}
}
}
}
我们先创建一个索引,如上设置,settings设置好自定义索引,起名pinyin_analyzer, 标记是my_pinyin,设置pinyin分词器的各项元素,感觉比较重要的是keep_full_pinyin:true, 汉语全量转拼音。接下来我们开始分词
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "刘德华",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 0,
"end_offset" : 0,
"type" : "word",
"position" : 2
}
]
}
看我们的pinyin分词已经将刘德华,分词了,还比较详细,使用term倒排查一下就出来,还是蛮好用的。
- alias索引别名
POST _aliases
{
"actions": [
{
"add": {
"index": "movies",
"alias": "myindex2",
"filter": {
"range": {
"year": {
"gte": 1
}
}
}
}
}
]
}
在给一个索引添加别名的时候可以附加一个filter过滤,新的别名索引里只能查询到filter过滤后的docs
- 复合查询
- 给查询算分结果*某个字段的值,提升权重
POST movies/_search
{
"explain": true,
"size": 2,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "Old",
"fields": ["title","genre.keyword"]
}
},
"field_value_factor": {
"field":"year",
"modifier": "log2p", //分值追加一个函数 _score * log(2 + factor * year)
"factor": 0.01 //增加函数进行收敛
}
}
}
}
如上是查询title、genre中带有old或者包含old的文档,并进行相关性打分,将打分结果*字段year的值,然后进行排序。
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 47,
"relation" : "eq"
},
"max_score" : 9.856819,
"hits" : [
{
"_shard" : "[movies][0]",
"_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
"_index" : "movies",
"_type" : "_doc",
"_id" : "72696",
"_score" : 9.856819,
"_source" : {
"year" : 2009,
"genre" : [
"Comedy"
],
"@version" : "1",
"id" : "72696",
"title" : "Old Dogs"
},
"_explanation" : {
"value" : 9.856819,
"description" : "function score, product of:",
"details" : [
{
"value" : 7.3328753,
"description" : "max of:",
"details" : [
{
"value" : 7.3328753,
"description" : "weight(title:old in 14201) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 7.3328753,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 6.3534727,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 47,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 27287,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.5246147,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.9695094,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 1.3441957,
"description" : "min of:",
"details" : [
{
"value" : 1.3441957,
"description" : "field value function: log2p(doc['year'].value * factor=0.01)",
"details" : [ ]
},
{
"value" : 3.4028235E38,
"description" : "maxBoost",
"details" : [ ]
}
]
}
]
}
},
{
"_shard" : "[movies][0]",
"_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
"_index" : "movies",
"_type" : "_doc",
"_id" : "50259",
"_score" : 9.852491,
"_source" : {
"year" : 2006,
"genre" : [
"Drama"
],
"@version" : "1",
"id" : "50259",
"title" : "Old Joy"
},
"_explanation" : {
"value" : 9.852491,
"description" : "function score, product of:",
"details" : [
{
"value" : 7.3328753,
"description" : "max of:",
"details" : [
{
"value" : 7.3328753,
"description" : "weight(title:old in 11233) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 7.3328753,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 6.3534727,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 47,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 27287,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.5246147,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.9695094,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 1.3436055,
"description" : "min of:",
"details" : [
{
"value" : 1.3436055,
"description" : "field value function: log2p(doc['year'].value * factor=0.01)",
"details" : [ ]
},
{
"value" : 3.4028235E38,
"description" : "maxBoost",
"details" : [ ]
}
]
}
]
}
}
]
}
}
我们看一下打分详情,即为 _score * log(2+ factor * year)
11.4更
- 提升分值 boost mode
POST movies/_search
{
"explain": true,
"size": 2,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "Old",
"fields": ["title","genre.keyword"]
}
},
"field_value_factor": {
"field": "year"
},
"boost_mode": "sum"
}
}
}
boost_mode 有四种模式
- multiply : 将field_value_factor中获取的数值与query中的相关性打分做乘法运算,然后进行排序
- sum: 算分与字段值因素的和
- min/max : 算分与字段值因素之间取最大/最小值作为相关性打分
- replace: 使用字段值因素取代算分
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 47,
"relation" : "eq"
},
"max_score" : 2020.3269,
"hits" : [
{
"_shard" : "[movies][0]",
"_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
"_index" : "movies",
"_type" : "_doc",
"_id" : "114250",
"_score" : 2020.3269,
"_source" : {
"year" : 2014,
"genre" : [
"Comedy",
"Drama"
],
"@version" : "1",
"id" : "114250",
"title" : "My Old Lady"
},
"_explanation" : {
"value" : 2020.3269,
"description" : "sum of",
"details" : [
{
"value" : 6.3268967,
"description" : "max of:",
"details" : [
{
"value" : 6.3268967,
"description" : "weight(title:old in 23775) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 6.3268967,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 6.3534727,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 47,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 27287,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.4526441,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.9695094,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 2014.0,
"description" : "min of:",
"details" : [
{
"value" : 2014.0,
"description" : "field value function: none(doc['year'].value * factor=1.0)",
"details" : [ ]
},
{
"value" : 3.4028235E38,
"description" : "maxBoost",
"details" : [ ]
}
]
}
]
}
}
]
}
}
从分析上来看,相关性的分6.3268967,而字段值因素是2014,所以总分是2020.3269
- max_boost : 最大提升上限,此参数可以限制字段值因素的最大分值上限,所获取的分值将在这个上限范围内
POST movies/_search
{
"explain": true,
"size": 1,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "Old",
"fields": ["title","genre.keyword"]
}
},
"field_value_factor": {
"field": "year"
},
"boost_mode": "sum",
"max_boost": 10
}
}
}
比如上面你的查询,field_value_factor的值会被限制在10(max_boost)内,最大10,因为boost_mode是sum,所以及果实查询的相关性打分加上这个字段值因素的最大值。
- random_score 一致性随机函数
GET movies/_search
{
"explain": true,
"size": 1,
"query": {
"function_score": {
"query": {
"term": {
"title": {
"value": "love"
}
}
},
"random_score": {
"seed": 314159265359,
"field":"_seq_no"
}
}
}
}
7.0之后需要random_score设置field字段,否则会报错,一致性随机函数是根据seed的的序号进行随机,如果seed的值是一样的,那么随机结果也是一致的。
- suggest 推荐模块,原理是将查询分解为token,在索引字典里查找相似的term返回
GET movies/_search
{
"size": 1,
"query": {
"term": {
"title": {
"value": "lover"
}
}
},
"suggest": {
"my_suggest": {
"text": "lover",
"term": {
"field": "title",
"suggest_mode":"popular"
}
}
}
}
suggest_mode有几种常用的,比如
- missing : 如果索引即terms => lover已经存在,则不提供建议
- popular: 推荐出现频率更加高的词
- always : 无论这个terms是否存在,都提供建议
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 12,
"relation" : "eq"
},
"max_score" : 8.87367,
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "2586",
"_score" : 8.87367,
"_source" : {
"year" : 1999,
"genre" : [
"Comedy",
"Crime",
"Thriller"
],
"@version" : "1",
"id" : "2586",
"title" : "Goodbye Lover"
}
}
]
},
"suggest" : {
"my_suggest" : [
{
"text" : "lover",
"offset" : 0,
"length" : 5,
"options" : [
{
"text" : "lovers",
"score" : 0.8,
"freq" : 25
},
{
"text" : "loved",
"score" : 0.8,
"freq" : 14
},
{
"text" : "love",
"score" : 0.75,
"freq" : 355
},
{
"text" : "lives",
"score" : 0.6,
"freq" : 40
},
{
"text" : "live",
"score" : 0.5,
"freq" : 72
}
]
}
]
}
}
推荐的信息放在自定义的数组中,有分值及频率。需要的时候可以自选。
插播一条刚才遇到的问题。线上es报错查询超过1w条
- 我们先来了解一下es的配置index.max_result_window,es的配置,可以是全局的,也可以针对某个索引设置,默认1w条
- 线上引起这次报错的查询来源是什么呢,是一个脚本,while取数,每次20条,没有退出条件,在平时这个脚本不会引发es报错,因为平时数据量没双十一这么高,这几天大促,数据量持续走高,所以导致了超过配置限制。
- 如何解决这个问题呢?有几个思路,第一,因为他是脚本查询,不是前台实时查询,所以允许延迟时间,这样我们就可以采用es的scroll查询,scroll查询不是针对于实时的,它会对es进行多次查询,通过记录scroll_id+快照的方式进行查询,我们可以指定查询的时间间隔
curl -XGET 'localhost:9200/index/type/_search?scroll=1m' -d '
{
"query": {
"match_phase" : {
"title" : "elasticsearch"
}
}
}
我们指定了scroll = 1min 即与下次查询之间最大间隔1min,超过则断联,第一次查询除了数据外还会返回一个scroll_id用作下次查询,所以下次查询就是如下查询
curl -XGET 'localhost:9200/_search/scroll' -d'
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}
scroll会一直向指定查询游走,直到查询到对应数据或者查不到数据或者超时断联时会停止请求。但是只是用scroll进行查询是有代价的,它会进行排序,最坏的情况下是全局排序。
- 所以有些时候我们深度分页的情况下只想要数据,而不想排序,我们可以加上scan参数
GET /old_index/_search?search_type=scan&scroll=1m
{
"query": { "match_all": {}},
"size": 1000
}
如上,我们只需加上search_type=scan,则可以禁止排序,从而避免全局排序。还有一种方式是使用_doc去sort得出来的结果,这个执行的效率最快,但是数据就不会有排序,适合用在只想取得所有数据的场景,示例如下
GET /old_index/_search?scroll=1m
{
"query": { "match_all": {}},
"size": 1000,
"sort": [
"_doc"
]
}
}
- 另外一个优化点是,在使用scroll游标查询的时候,在查询完毕的时候尽可能的清除这个scroll,这样可以减轻es的负担
DELETE 127.0.0.1:9200/_search/scroll
{
"scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAdsMqFmVkZTBJalJWUmp5UmI3V0FYc2lQbVEAAAAAAHbDKRZlZGUwSWpSVlJqeVJiN1dBWHNpUG1RAAAAAABpX2sWclBEekhiRVpSRktHWXFudnVaQ3dIQQAAAAAAaV9qFnJQRHpIYkVaUkZLR1lxbnZ1WkN3SEEAAAAAAGlfaRZyUER6SGJFWlJGS0dZcW52dVpDd0hB"
}