Elasticsearch常见问题处理

原创

herlly 2023-04-25 16:28:18 博主文章分类：SRE ©著作权

文章标签 nginx ide elasticsearch 文章分类 运维

©著作权归作者所有：来自51CTO博客作者herlly的原创作品，请联系作者获取转载授权，否则将追究法律责任

1、查看集群、index、shard级别的健康状态，各种节点的数量，分片统计等信息

curl -XGET localhost:9200/_cluster/health
curl -XGET localhost:9200/_cluster/health?level=indices
curl -XGET localhost:9200/_cluster/health?level=shards
curl -XGET localhost:9200/_cluster/health/index_name?level=indices
curl -XGET localhost:9200/_cluster/health/index_name?level=shards

2、查询集群统计信息

curl localhost:9200/_cluster/stats

3、列出状态为red的index

curl -s 'localhost:9200/_cat/indices?v'|grep '^red'

4、列出状态为yellow的index

curl -s 'localhost:9200/_cat/indices?v'|grep '^yellow'

5、查询集群所有配置，包含默认值

curl 'http://127.0.0.1:9200/_cluster/settings?pretty&include_defaults'

6、查询每个节点上的任务信息

curl -XGET localhost:9200/_cat/tasks?v=true

7、集群处于red、yellow状态、集群存在unassigned的分片

集群有index主分片存在unassigned的分片，index处于red状态
集群有index副本存在unassigned的分片，index将处于yellow状态
首先查询shard未分配的原因，unassigned 原因有很多，先在做好记录，恢复服务后再细查原因
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
##curl localhost:9200/_cluster/allocation/explain 

执行reroute，重新分配unassigned的分片
curl -XPOST 'http://127.0.0.1:9200/_cluster/reroute?retry_failed=true'

8、集群存在任务堆积

任务优先级
IMMEDIATE > URGENT > HIGH > NORMAL > LOW > LANGUID.
 
查询任务堆积详情. 如果都是正常的任务，可以观察当前master节点的使用情况，查询master日志。
curl http://127.0.0.1:9200/_cat/pending_tasks
curl http://127.0.0.1:9200/_cluster/pending_tasks #查询task是否在执行，在队列中等待的时间等信息

9、集群数据插入异常，节点日志里报breaker.Circuit相关的异常

查询是那台主机发生breaker circuit
curl -X GET "localhost:9200/_nodes/stats/breaker?pretty"

10、分片丢失

有时候节点异常、磁盘异常、副本数设置不合理等导致部分分片丢失，无法恢复，导致整个index处于red状态，为了保留其他分片的数据，且是index恢复到green状态，我们可以为丢失的副本分配一个空分片。
curl -XPOST "localhost:9200/_cluster/reroute?pretty" -H 'Content-Type: application/json' -d'
{
    "commands" : [
        {
            "allocate_empty_primary" : {
                "index" : "douaikan_index_back",   #index name
                "shard" : 0,                       #index shard number
                "node" : "ee-elasticsearch023",    #分配节点名
                "accept_data_loss" : "true"
            }
       }
    ]
}
'

11、查询es异常索引

GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details

12、ES集群有个节点挂了，集群状态一下子就red了，重新启动后，等了许久，发现始终有几个分片无法恢复，运行命令如下

curl -XGET localhost:9200/_cluster/allocation/explain?pretty        
{
  "index" : "twitter",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-11-06T06:11:15.562Z",
 "failed_allocation_attempts" : 5,                                                                       [0/819]
    "details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[t
witter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXMKnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300
}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[fai
led to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerL
ength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/trans
log/translog-1228.ckp\"))]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in
-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "CxXWE8BiQbS4ThB9AvvGQA",
      "node_name" : "node-1",
      "transport_address" : "10.142.0.2:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "gxegPAMyQa21MH5NxQEACw"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - man
ually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-11-
06T06:11:15.562Z], failed_attempts[5], delayed=false, details[failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed
 recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXM
KnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway
]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec f
ooter (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/n
odes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
该原因是：某节点上的分片尝试恢复5次没有成功，然后就丢弃不管。导致该分片无法恢复
解决办法
POST /_cluster/reroute?retry_failed=true
重新恢复失败的分片，一会集群就恢复为green

备注：

GET _search
{
  "query": {
    "match_all": {}
  }
}

GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details


GET _cat/indices

GET /_cluster/allocation/explain

nginx-access-2022.08.12 

POST /_cluster/reroute?retry_failed=true

POST /_cluster/reroute

GET _cat/indices?v&health=yellow

GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details

POST /_cache/clear

POST /nginx-access-2022.08.12 /_close




GET /_cat/indices?v&h=index,fielddata.memory_size&s=fielddata.memory_size:desc

GET /_nodes/stats/breaker

GET _search
{
  "query": {
    "match_all": {}
  }
}

GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details


GET _cat/indices

GET /_cluster/allocation/explain?pretty'

nginx-access-2022.08.12 

POST /_cluster/reroute?retry_failed=true


GET _cat/indices?v&health=yellow

GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details

POST /_cache/clear

POST /nginx-access-2022.08.12/_open


GET _cluster/health?level=indices
GET _cluster/health?level=shards


GET /_cat/indices?v&h=index,fielddata.memory_size&s=fielddata.memory_size:desc

GET _cluster/health
GET _cat/health
GET /_nodes/stats/breaker



GET /nginx-access-2022.08.12/_shard_stores


GET _cluster/state


GET _cluster/allocation/explain
PUT nginx-access-2022.08.12/_settings
{
  "index":{
    "number_of_replicas": "0"
  }
}