近期因为生产上es集群出现了内存快速占用满、频繁gc、集群无响应的现象,查看集群日志和满查询日志,发现都是因为频繁gc集群无响应后出现的报错、gc高频的警告以及平时不慢的查询报文出现在慢查询日志里。
看kibana的监控,发现内存几乎是在几秒内就占用满,并且满了以后,由于可用内存不足就开始频繁的full gc,cpu居高不下,集群此时基本在无法响应的状态,有遇到这种情况,只能重启才能解决,无法自己恢复,内存虽然已经占用满,但是集群日志中没有OOM的异常,而且出现的概率比较随机。
由于之前没有遇到过这个问题,所以想能不能通过prometheus的监控来看是不是在集群异常时有一些异常指标,没想到部署好以后,集群就没有这个问题了。
问了一下同事有做了哪些改动:
- es的
index.max_result_window
从100000改成10000- 应用内的es动态查询的做了一些限制
但是由于生产是一个3 * 16c 主机 32g es 16g的一个集群配置,所以比较复杂离谱的动态查询在平时正常的时候性能都比较好,包括10+个Wildcard + 排序的分页查询,总匹配数量大概就几千条,应该不像是动态查询的问题。
会不会是一次性查询过大的量,涌入内存,但是还没查询完整,所以无法响应,导致无法释放且无法响应,导致内存占用满,频繁gc的问题呢。
我在开发环境一个2 * 8c 主机16g es 4g的一个集群想重现一下这个场景,空闲时大约剩余2g左右的内存,由于开发环境单条数据量较小,一次性查询约1000000条,300M+存储的数据量,可以重现内存快速占用满,频繁gc,集群无响应的情况,不同的是本地集群在慢查询内,会输出导致异常的查询报文以及相关fetch数据的日志,生产上是不会的,这点目前不知道为什么,所以也不能确定集群就是这个问题。
以下为集群日志内高频gc的部分日志:
[2022-07-04T11:12:17,141][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][431][8] duration [6.9s], collections [1]/[7.8s], total [6.9s]/[25.8s], memory [3.5gb]->[3.7gb]/[3.9gb], all_pools {[young] [1mb]->[79.7mb]/[266.2mb]}{[survivor] [994.2kb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:17,141][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][431] overhead, spent [7s] collecting in the last [7.8s]
[2022-07-04T11:12:23,186][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][432][9] duration [5.9s], collections [1]/[6s], total [5.9s]/[31.7s], memory [3.7gb]->[3.6gb]/[3.9gb], all_pools {[young] [79.7mb]->[34.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:23,187][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][432] overhead, spent [5.9s] collecting in the last [6s]
[2022-07-04T11:12:31,575][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][433][10] duration [8.3s], collections [1]/[8.3s], total [8.3s]/[40.1s], memory [3.6gb]->[3.6gb]/[3.9gb], all_pools {[young] [34.4mb]->[13.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:31,575][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][433] overhead, spent [8.3s] collecting in the last [8.3s]
[2022-07-04T11:12:37,633][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][434][11] duration [5.9s], collections [1]/[6s], total [5.9s]/[46s], memory [3.6gb]->[3.6gb]/[3.9gb], all_pools {[young] [13.5mb]->[21mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:37,633][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][434] overhead, spent [5.9s] collecting in the last [6s]
[2022-07-04T11:12:45,879][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][435][12] duration [8.1s], collections [1]/[8.2s], total [8.1s]/[54.2s], memory [3.6gb]->[3.7gb]/[3.9gb], all_pools {[young] [21mb]->[36.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:45,879][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][435] overhead, spent [8.1s] collecting in the last [8.2s]
[2022-07-04T11:12:51,901][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][436][13] duration [5.9s], collections [1]/[6s], total [5.9s]/[1m], memory [3.7gb]->[3.6gb]/[3.9gb], all_pools {[young] [36.5mb]->[5.8mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:12:51,901][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][436] overhead, spent [5.9s] collecting in the last [6s]
[2022-07-04T11:12:51,911][WARN ][o.e.c.InternalClusterInfoService] [CqSCISw] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
[2022-07-04T11:13:00,316][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][437][14] duration [8.3s], collections [1]/[8.4s], total [8.3s]/[1.1m], memory [3.6gb]->[3.6gb]/[3.9gb], all_pools {[young] [5.8mb]->[10.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:13:00,337][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][437] overhead, spent [8.3s] collecting in the last [8.4s]
[2022-07-04T11:13:06,533][INFO ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][old][438][15] duration [6s], collections [1]/[6.2s], total [6s]/[1.2m], memory [3.6gb]->[3.7gb]/[3.9gb], all_pools {[young] [10.1mb]->[41.1mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.6gb]->[3.6gb]/[3.6gb]}
[2022-07-04T11:13:06,534][WARN ][o.e.m.j.JvmGcMonitorService] [CqSCISw] [gc][438] overhead, spent [6s] collecting in the last [6.2s]
所以如非必要最好还是不要修改默认10000条的index.max_result_window
配置,如果确实有需要要经过慎重的测试以后,再上生产。