elasticsearch scroll

一、背景

最近有个ES 同步的需求,了解了一下子 es的分页处理的方式,本文主要讲解一下es中的scroll的分页的处理​

二、原文翻译

2.1 介绍

当搜索请求返回单个“页面”的结果时,scroll API可用于从单个搜索请求中检索大量结果(甚至所有结果),其方式与在传统数据库中使用光标的方式大致相同。(While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.)



滚动不是为了实时用户请求,而是为了处理大量数据,例如为了将一个索引的内容重新索引为具有不同配置的新索引。(Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.)

客户支持滚动和重新索引
      Elasticsearch::Client::5_0::Bulk   Elasticsearch::Client::5_0::Scroll

对于这个快照的理解,意思就是这个Scroll是一个快照,ES内部为了这个快照的信息,以后修改的对于这个查询无效的,因此是非实时的信息。

备注:从滚动请求返回的结果反映了发出初始搜索请求时索引的状态,就像及时的快照一样。文档的后续更改(索引、更新或删除)只会影响以后的搜索请求。(The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.)

2.2 使用

为了使用scroll,初始搜索请求应该在查询字符串中指定scroll参数,它告诉elasticsearch它应该保持“搜索上下文”活动多长时间(参见保持搜索上下文活动)(In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m.)

2.2.1 步骤一

请求

POST /bank/account/_search?scroll=1m
{
"query": {
"match_all": {}
},
"size":1
}

返回结果

{
"_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAClFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAAphYxZHI1SUVNbFFGZXF3czFRbkVsWENBAAAAAAAAAKcWMWRyNUlFTWxRRmVxd3MxUW5FbFhDQQAAAAAAAACoFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAAqRYxZHI1SUVNbFFGZXF3czFRbkVsWENB",
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 1.0,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"account_number" : 44,
"balance" : 34487,
"firstname" : "Aurelia",
"lastname" : "Harding",
"age" : 37,
"gender" : "M",
"address" : "502 Baycliff Terrace",
"employer" : "Orbalix",
"email" : "aureliaharding@orbalix.com",
"city" : "Yardville",
"state" : "DE"
}
}
]
}
}

2.2.2 步骤二 __scroll_id

The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.(上述请求的结果包含一个滚动id,该id应传递给滚动api以检索下一批结果。)



 请求

1> ​​GET​​ or ​​POST​​ can be used and the URL should not include the ​​index​​ name — this is specified in the original ​​search​​ request instead. url 不包含任务的索引名称

2> The ​​scroll​​ parameter tells Elasticsearch to keep the search context open for another ​​1m​​. scroll参数告诉Elasticsearch将搜索上下文再打开1分钟

3> 当前的scroll_id 还是最原始的那个scroll_id

POST /_search/scroll //1
{
"scroll" : "1m", //2
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAADUFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAA1RYxZHI1SUVNbFFGZXF3czFRbkVsWENBAAAAAAAAANYWMWRyNUlFTWxRRmVxd3MxUW5FbFhDQQAAAAAAAADXFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAA2BYxZHI1SUVNbFFGZXF3czFRbkVsWENB"
}

返回结果

{
"_scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAADUFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAA1RYxZHI1SUVNbFFGZXF3czFRbkVsWENBAAAAAAAAANYWMWRyNUlFTWxRRmVxd3MxUW5FbFhDQQAAAAAAAADXFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAA2BYxZHI1SUVNbFFGZXF3czFRbkVsWENB",
"took" : 6,
"timed_out" : false,
"terminated_early" : true,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 1.0,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "99",
"_score" : 1.0,
"_source" : {
"account_number" : 99,
"balance" : 47159,
"firstname" : "Ratliff",
"lastname" : "Heath",
"age" : 39,
"gender" : "F",
"address" : "806 Rockwell Place",
"employer" : "Zappix",
"email" : "ratliffheath@zappix.com",
"city" : "Shaft",
"state" : "ND"
}
}
]
}
}

下个分页的请求就是…

scroll_id 还是最开始的那个哦,不过这个有时效性,超过了一分钟 search context alive就失效了…,这个参数的意思是这个哦

POST /_search/scroll 
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAADUFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAA1RYxZHI1SUVNbFFGZXF3czFRbkVsWENBAAAAAAAAANYWMWRyNUlFTWxRRmVxd3MxUW5FbFhDQQAAAAAAAADXFjFkcjVJRU1sUUZlcXdzMVFuRWxYQ0EAAAAAAAAA2BYxZHI1SUVNbFFGZXF3czFRbkVsWENB"
}

失效后的效果,只要你获取结果,处理请求到下一次遍历结果时间在一分钟之内即可哦!

{
"error" : {
"root_cause" : [
{
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [222]"
},
{
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [223]"
},
{
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [224]"
},
{
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [225]"
},
{
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [226]"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : -1,
"index" : null,
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [222]"
}
},
{
"shard" : -1,
"index" : null,
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [223]"
}
},
{
"shard" : -1,
"index" : null,
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [224]"
}
},
{
"shard" : -1,
"index" : null,
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [225]"
}
},
{
"shard" : -1,
"index" : null,
"reason" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [226]"
}
}
],
"caused_by" : {
"type" : "search_context_missing_exception",
"reason" : "No search context found for id [226]"
}
},
"status" : 404
}

2.2.3 size

size参数允许您配置每个批次结果返回的最大命中次数。对scroll API的每次调用都返回下一批结果,直到没有更多的结果可返回为止,即hits数组为空。(The size parameter allows you to configure the maximum number of hits to be returned with each batch of results. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty)

2.2.4 important

最初的搜索请求和随后的每个滚动请求都返回一个滚动id。虽然滚动id可以在请求之间更改,但它并不总是更改-在任何情况下,只有当前的_scroll_id在使用才会改变添加一个哦。

The initial search request and each subsequent scroll request each return a _scroll_id. While the _scroll_id may change between requests, it doesn’t always change — in any case, only the most recently received _scroll_id should be used.



If the request specifies aggregations, only the initial search response will contain the aggregations results.(如果请求指定聚合,则只有初始搜索响应将包含聚合结果。)



scroll-search-context:Keeping older segments alive means that more disk space and file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See File Descriptors.(使旧段保持活动状态意味着需要更多的磁盘空间和文件句柄。确保已将节点配置为具有足够的可用文件句柄请参见文件描述符。)



Additionally, if a segment contains deleted or updated documents then the search context must keep track of whether each document in the segment was live at the time of the initial search request. Ensure that your nodes have sufficient heap space if you have many open scrolls on an index that is subject to ongoing deletes or updates.(此外,如果段包含已删除或更新的文档,则搜索上下文必须跟踪段中的每个文档在初始搜索请求时是否处于活动状态如果索引上有许多打开的滚动条,并且这些滚动条可能会受到正在进行的删除或更新的影响,请确保节点有足够的堆空间。)



To prevent against issues caused by having too many scrolls open, the user is not allowed to open scrolls past a certain limit. By default, the maximum number of open scrolls is 500. This limit can be updated with the search.max_open_scroll_context cluster setting.(为了防止由于打开过多滚动条而导致的问题,用户不允许打开超过特定限制的滚动条默认情况下,打开卷轴的最大数量为500。可以使用search.max_open_scroll_context cluster设置更新此限制。)

2.3 check how many search contexts are open

检测多少context打开着的

GET /_nodes/stats/indices/search

每次打开一次,就会增加5次scroll_current,因为默认有5个number_of_shards哦

{
"_nodes" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"cluster_name" : "elasticsearch",
"nodes" : {
"1dr5IEMlQFeqws1QnElXCA" : {
"timestamp" : 1572621835583,
"name" : "1dr5IEM",
"transport_address" : "10.65.136.12:9300",
"host" : "10.65.136.12",
"ip" : "10.65.136.12:9300",
"roles" : [
"master",
"data",
"ingest"
],
"indices" : {
"search" : {
"open_contexts" : 5,
"query_total" : 586,
"query_time_in_millis" : 571,
"query_current" : 0,
"fetch_total" : 92,
"fetch_time_in_millis" : 92,
"fetch_current" : 0,
"scroll_total" : 100,
"scroll_time_in_millis" : 9001895,
"scroll_current" : 5,
"suggest_total" : 0,
"suggest_time_in_millis" : 0,
"suggest_current" : 0
}
}
}
}
}

{
"settings": {
"index": {
"number_of_shards": "5",
"blocks": {
"read_only_allow_delete": "true"
},
"provided_name": "bank",
"creation_date": "1541430925075",
"number_of_replicas": "1",
"uuid": "vPF31jJwTgONtdumP13YvQ",
"version": {
"created": "6040299",
"upgraded": "6080299"
}
}
},

2.4 Clear scroll API

当超过滚动超时时,将自动删除搜索上下文但是,如前一节所述,保持滚动条打开是有代价的,因此一旦滚动条不再使用,就应该显式地清除滚动条。

DELETE /_search/scroll
{
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

DELETE /_search/scroll
{
"scroll_id" : [
"DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
"DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
]
}
//All search contexts can be cleared with the _all parameter:
DELETE /_search/scroll/_all