生产中使用webhook对历史告警进行统计时发现,有些resolved消息没有对应的firing消息、有些的firing消息没有对应的resolved消息、有些resolved消息发送了多次、有些firing消息没有按照repeat_interval间隔重复且短时间内发送了多次。这些问题主要由group_wait和group_interval两个参数引起。

Alertmanager处理告警消息流程

文档中对这个参数的解释如下:

# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

流程参考:https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go

当alertmanager接收到一条新的alert时,会先根据group_by为其确定一个聚合组group,然后等待group_wait时间,如果在此期间接收到同一group的其他alert,则这些alert会被合并,然后再发送(alertmanager发送消息单位是group)。此参数的作用是防止短时间内出现大量告警的情况下,接收者被告警淹没。

在该组的alert第一次被发送后,该组会进入睡眠/唤醒周期,睡眠周期将持续group_interval时间,在睡眠状态下该group不会进行任何发送告警的操作(但会插入/更新(根据fingerprint)group中的内容),睡眠结束后进入唤醒状态,然后检查是否需要发送新的alert或者重复已发送的alert(resolved类型的alert在发送完后会从group中剔除)。这就是group_interval的作用。

聚合组在每次唤醒才会检查上一次发送alert是否已经超过repeat_interval时间,如果超过则再次发送该告警。因此repeat_interval并不代表告警的实际重复间隔,因为在第一次发送告警的repeat_interval时间后,聚合组可能还处在睡眠状态,所以实际的告警间隔应该大于repeat_interval且小于repeat_interval+group_interval。因此实际生产中group_interval值不可设得太大。

出现四种情况的原因

有些resolved alert没有对应的firing alert?

现在回头来解释一下为什么有些resolved alert没有对应的firing alert,因为这些firing alert发送给alertmanager时其所在的group恰好处在睡眠状态下,而其对应的resolved消息也在同一睡眠周期内被发送给alertmanager,接收到resolved消息后,group将其对应的firing消息覆盖,因此在唤醒时就只接收到了resolved消息。

有些的firing alert没有对应的resolved alert?

同理,为什么有些的firing alert没有对应的resolved alert呢?假设该firing消息发生在第n个睡眠周期,而在第n+1个睡眠周期内,该alert发生了resolved-firing-resolved...这样的状态变化,则其对应的resolved消息被n+1周期内的第二个resolved消息覆盖,因此表现为该firing alert没有对应的resolved消息。

收到多条重复的resolved alert?

为什么有些resolved消息接收到了多条?这个问题又涉及到prometheus rule组件的一个特性,当一个alert由firing变成resolved后,该resolved alert不会只发送给alertmanager一次,而是会先保存在内存中15分钟,并且重复多次发送给alertmanager,参看如下代码段(https://github.com/prometheus/prometheus/blob/main/rules/alerting.go):

// resolvedRetention is the duration for which a resolved alert instance
// is kept in memory state and consequently repeatedly sent to the AlertManager.
const resolvedRetention = 15 * time.Minute

发送多条resolved的情况为:在第n个睡眠周期内,alertmanager接收到第一条resolved alert并将其更新进group,紧接着在唤醒时发送该group并将resolved alert从group中剔除。但在第n+1个睡眠周期内,prometheus仍然在向alertmanager发送该resolved alert,因此下次唤醒时发送的group中又带有这条resolved alert。

firing alert短时间发送了多次?

这个容易理解,如上所述,alertmanager发送消息的单位是group,在该group被发送的下一个睡眠周期中,又有新的alert被insert到该group中,因此下一次唤醒时又发送了一次该group,表现为同一条firing alert短时间内发送了多次。

推荐配置

综合上面的分析,给出如下最佳配置,能尽可能避免这些情况的出现。

group_wait: 1m
  group_interval: 15m
  repeat_interval: 4h