1、查看集群、index、shard级别的健康状态,各种节点的数量,分片统计等信息
curl -XGET localhost:9200/_cluster/health
curl -XGET localhost:9200/_cluster/health?level=indices
curl -XGET localhost:9200/_cluster/health?level=shards
curl -XGET localhost:9200/_cluster/health/index_name?level=indices
curl -XGET localhost:9200/_cluster/health/index_name?level=shards
2、查询集群统计信息
curl localhost:9200/_cluster/stats
3、列出状态为red的index
curl -s 'localhost:9200/_cat/indices?v'|grep '^red'
4、列出状态为yellow的index
curl -s 'localhost:9200/_cat/indices?v'|grep '^yellow'
5、查询集群所有配置,包含默认值
curl 'http://127.0.0.1:9200/_cluster/settings?pretty&include_defaults'
6、查询每个节点上的任务信息
curl -XGET localhost:9200/_cat/tasks?v=true
7、集群处于red、yellow状态、集群存在unassigned的分片
集群有index主分片存在unassigned的分片,index处于red状态
集群有index副本存在unassigned的分片,index将处于yellow状态
首先查询shard未分配的原因,unassigned 原因有很多,先在做好记录,恢复服务后再细查原因
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
##curl localhost:9200/_cluster/allocation/explain
执行reroute,重新分配unassigned的分片
curl -XPOST 'http://127.0.0.1:9200/_cluster/reroute?retry_failed=true'
8、集群存在任务堆积
任务优先级
IMMEDIATE > URGENT > HIGH > NORMAL > LOW > LANGUID.
查询任务堆积详情. 如果都是正常的任务,可以观察当前master节点的使用情况,查询master日志。
curl http://127.0.0.1:9200/_cat/pending_tasks
curl http://127.0.0.1:9200/_cluster/pending_tasks #查询task是否在执行,在队列中等待的时间等信息
9、集群数据插入异常,节点日志里报breaker.Circuit相关的异常
查询是那台主机发生breaker circuit
curl -X GET "localhost:9200/_nodes/stats/breaker?pretty"
10、分片丢失
有时候节点异常、磁盘异常、副本数设置不合理等导致部分分片丢失,无法恢复,导致整个index处于red状态,为了保留其他分片的数据,且是index恢复到green状态,我们可以为丢失的副本分配一个空分片。
curl -XPOST "localhost:9200/_cluster/reroute?pretty" -H 'Content-Type: application/json' -d'
{
"commands" : [
{
"allocate_empty_primary" : {
"index" : "douaikan_index_back", #index name
"shard" : 0, #index shard number
"node" : "ee-elasticsearch023", #分配节点名
"accept_data_loss" : "true"
}
}
]
}
'
11、查询es异常索引
GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details
12、ES集群有个节点挂了,集群状态一下子就red了,重新启动后,等了许久,发现始终有几个分片无法恢复,运行命令如下
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "twitter",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2018-11-06T06:11:15.562Z",
"failed_allocation_attempts" : 5, [0/819]
"details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[t
witter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXMKnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300
}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[fai
led to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerL
ength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/trans
log/translog-1228.ckp\"))]; ",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in
-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "CxXWE8BiQbS4ThB9AvvGQA",
"node_name" : "node-1",
"transport_address" : "10.142.0.2:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "gxegPAMyQa21MH5NxQEACw"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - man
ually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-11-
06T06:11:15.562Z], failed_attempts[5], delayed=false, details[failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed
recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXM
KnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway
]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec f
ooter (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/n
odes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ], allocation_status[deciders_no]]]"
}
]
}
]
该原因是:某节点上的分片尝试恢复5次没有成功,然后就丢弃不管。导致该分片无法恢复
解决办法
POST /_cluster/reroute?retry_failed=true
重新恢复失败的分片,一会集群就恢复为green
备注:
GET _search
{
"query": {
"match_all": {}
}
}
GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details
GET _cat/indices
GET /_cluster/allocation/explain
nginx-access-2022.08.12
POST /_cluster/reroute?retry_failed=true
POST /_cluster/reroute
GET _cat/indices?v&health=yellow
GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details
POST /_cache/clear
POST /nginx-access-2022.08.12 /_close
GET /_cat/indices?v&h=index,fielddata.memory_size&s=fielddata.memory_size:desc
GET /_nodes/stats/breaker
GET _search
{
"query": {
"match_all": {}
}
}
GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details
GET _cat/indices
GET /_cluster/allocation/explain?pretty'
nginx-access-2022.08.12
POST /_cluster/reroute?retry_failed=true
GET _cat/indices?v&health=yellow
GET /_cat/shards?v&h=n,index,shard,prirep,state,sto,sc,unassigned.reason,unassigned.details
POST /_cache/clear
POST /nginx-access-2022.08.12/_open
GET _cluster/health?level=indices
GET _cluster/health?level=shards
GET /_cat/indices?v&h=index,fielddata.memory_size&s=fielddata.memory_size:desc
GET _cluster/health
GET _cat/health
GET /_nodes/stats/breaker
GET /nginx-access-2022.08.12/_shard_stores
GET _cluster/state
GET _cluster/allocation/explain
PUT nginx-access-2022.08.12/_settings
{
"index":{
"number_of_replicas": "0"
}
}