alertmanager 告警多个 inhibit_rules source_match alertmanager集群重复告警

转载

colddawn 2024-03-28 23:41:34

文章标签 prometheus alertmanager github 发送消息代码段 文章分类 架构后端开发

生产中使用webhook对历史告警进行统计时发现，有些resolved消息没有对应的firing消息、有些的firing消息没有对应的resolved消息、有些resolved消息发送了多次、有些firing消息没有按照repeat_interval间隔重复且短时间内发送了多次。这些问题主要由group_wait和group_interval两个参数引起。

Alertmanager处理告警消息流程

文档中对这个参数的解释如下：

# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

流程参考：https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go

当alertmanager接收到一条新的alert时，会先根据group_by为其确定一个聚合组group，然后等待group_wait时间，如果在此期间接收到同一group的其他alert，则这些alert会被合并，然后再发送（alertmanager发送消息单位是group）。此参数的作用是防止短时间内出现大量告警的情况下，接收者被告警淹没。

在该组的alert第一次被发送后，该组会进入睡眠/唤醒周期，睡眠周期将持续group_interval时间，在睡眠状态下该group不会进行任何发送告警的操作（但会插入/更新(根据fingerprint)group中的内容），睡眠结束后进入唤醒状态，然后检查是否需要发送新的alert或者重复已发送的alert(resolved类型的alert在发送完后会从group中剔除)。这就是group_interval的作用。

聚合组在每次唤醒才会检查上一次发送alert是否已经超过repeat_interval时间，如果超过则再次发送该告警。因此repeat_interval并不代表告警的实际重复间隔，因为在第一次发送告警的repeat_interval时间后，聚合组可能还处在睡眠状态，所以实际的告警间隔应该大于repeat_interval且小于repeat_interval+group_interval。因此实际生产中group_interval值不可设得太大。

出现四种情况的原因

有些resolved alert没有对应的firing alert？

现在回头来解释一下为什么有些resolved alert没有对应的firing alert，因为这些firing alert发送给alertmanager时其所在的group恰好处在睡眠状态下，而其对应的resolved消息也在同一睡眠周期内被发送给alertmanager，接收到resolved消息后，group将其对应的firing消息覆盖，因此在唤醒时就只接收到了resolved消息。

有些的firing alert没有对应的resolved alert？

同理，为什么有些的firing alert没有对应的resolved alert呢？假设该firing消息发生在第n个睡眠周期，而在第n+1个睡眠周期内，该alert发生了resolved-firing-resolved...这样的状态变化，则其对应的resolved消息被n+1周期内的第二个resolved消息覆盖，因此表现为该firing alert没有对应的resolved消息。

收到多条重复的resolved alert？

为什么有些resolved消息接收到了多条？这个问题又涉及到prometheus rule组件的一个特性，当一个alert由firing变成resolved后，该resolved alert不会只发送给alertmanager一次，而是会先保存在内存中15分钟，并且重复多次发送给alertmanager，参看如下代码段（https://github.com/prometheus/prometheus/blob/main/rules/alerting.go）：

// resolvedRetention is the duration for which a resolved alert instance
// is kept in memory state and consequently repeatedly sent to the AlertManager.
const resolvedRetention = 15 * time.Minute

发送多条resolved的情况为：在第n个睡眠周期内，alertmanager接收到第一条resolved alert并将其更新进group，紧接着在唤醒时发送该group并将resolved alert从group中剔除。但在第n+1个睡眠周期内，prometheus仍然在向alertmanager发送该resolved alert，因此下次唤醒时发送的group中又带有这条resolved alert。

firing alert短时间发送了多次？

这个容易理解，如上所述，alertmanager发送消息的单位是group，在该group被发送的下一个睡眠周期中，又有新的alert被insert到该group中，因此下一次唤醒时又发送了一次该group，表现为同一条firing alert短时间内发送了多次。

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯