首先我们看到Spring Boot Admin 的控制台是能看到很多监控指标的,如图:
在平时,我们发现服务实例异常,排查问题时,堆内存、GC、线程数量都是需要我们考虑的,因此也希望能够在JVM堆内存达到一定阈值的时候进行提醒,并通知 当前的堆大小,已使用大小,当前线程总数
,等可以参考的指标。
然SBA2并没有提供该类告警事件,那么我们就自己来简单
的实现这个告警。看下最终的飞书提醒效果:
我们话不多说看代码:
AlarmMessage
public interface AlarmMessage {
/**
* 发送文本告警
* @param content
*/
void sendData(String content);
}
借用FeiShuNotifier类发送通知。
NotifierAutoConfiguration.jvmAlarm
@Bean(initMethod = "start", destroyMethod = "stop")
@ConditionalOnProperty(prefix = "spring.boot.admin.notify.jvm", name = "enabled", havingValue = "true")
@ConfigurationProperties("spring.boot.admin.notify.jvm")
@Lazy(false)
public JvmAlarm jvmAlarm(InstanceRepository repository, AlarmMessage alarmMessage) {
return new JvmAlarm(repository, alarmMessage);
}
定义了JVM告警配置,当"spring.boot.admin.notify.jvm.enabled=true
开启JVM告警。start
初始化
JvmAlarm.start
private void start() {
this.scheduler = Schedulers.newSingle("jvm-check");
this.subscription = Flux.interval(Duration.ofSeconds(this.interval)).subscribeOn(this.scheduler).subscribe(this::checkFn);
}
start 会创建一个用于定时检查JVM的定时任务。stop
Bean销毁的时候同时销毁定时任务
private void stop() {
if (this.subscription != null) {
this.subscription.dispose();
this.subscription = null;
}
if (this.scheduler != null) {
this.scheduler.dispose();
this.scheduler = null;
}
}
JvmAlarm
public class JvmAlarm {
private final RestTemplate restTemplate = new RestTemplate();
private Scheduler scheduler;
private Disposable subscription;
private InstanceRepository repository;
private AlarmMessage alarmMessage;
/**
* jvm 阈值
*/
private double threshold = 0.95;
/**
* 累计告警次数
*/
private int alarmCountThreshold = 3;
/**
* 检测频率,秒
*/
private long interval = 10;
/**
* 格式化模版
*/
private final DecimalFormat df = new DecimalFormat("0.00M");
/**
* 排除实例
*/
private String excludeInstances = "";
/**
* 开关
*/
private boolean enabled = true;
/**
* 提醒模版
*/
private final String ALARM_TPL = "服务实例【%s】,内存使用超阈值【%s】,累计【%s】次,当前最大内存【%s】,已使用【%s】,当前线程数【%s】";
/**
* 超过阈值次数
*/
private final Map<String, Integer> instanceCount = new HashMap<>();
public JvmAlarm(InstanceRepository repository, AlarmMessage alarmMessage) {
this.repository = repository;
this.alarmMessage = alarmMessage;
}
private void checkFn(Long aLong) {
if (!enabled) {
return;
}
log.debug("check jvm for all instances");
repository.findAll().filter(instance -> !excludeInstances.contains(instance.getRegistration().getName())).map(instance -> {
String instanceName = instance.getRegistration().getName();
//最大堆空间
double jvmMax = getJvmValue(instance, "jvm.memory.max?tag=area:heap") / (1024*1024d);
//已使用堆空间
double jvmUsed = getJvmValue(instance, "jvm.memory.used?tag=area:heap") / (1024*1024d);
if (jvmMax != 0 && jvmUsed / jvmMax > threshold && instanceCount.computeIfAbsent(instanceName, key -> 0) > alarmCountThreshold) {
//当前活跃线程数
int threads = (int) getJvmValue(instance, "jvm.threads.live");
String content = String.format(ALARM_TPL, instanceName, (threshold * 100) + "%", alarmCountThreshold, df.format(jvmMax), df.format(jvmUsed), threads);
alarmMessage.sendData(content);
//重新计算
instanceCount.remove(instanceName);
}
//更新累计超过阈值次数
if (jvmMax != 0 && jvmUsed / jvmMax > threshold) {
instanceCount.computeIfPresent(instanceName, (key, value) -> value + 1);
}
return Mono.just(0d);
}).subscribe();
}
private long getJvmValue(Instance instance, String tags) {
try {
String reqUrl = instance.getRegistration().getManagementUrl() + "/metrics/" + tags;
log.debug("check jvm {},uri {}", instance.getRegistration().getName(), reqUrl);
ResponseEntity<String> responseEntity = restTemplate.getForEntity(reqUrl, String.class);
String body = responseEntity.getBody();
JSONObject bodyObject = JSON.parseObject(body);
JSONArray measurementsArray = bodyObject.getJSONArray("measurements");
if (measurementsArray != null && !measurementsArray.isEmpty()) {
return measurementsArray.getJSONObject(0).getLongValue("value");
}
} catch (Exception ex) {
log.error(ex.getMessage());
}
return 0L;
}
public long getInterval() {
return interval;
}
public void setInterval(long interval) {
this.interval = interval;
}
public String getExcludeInstances() {
return excludeInstances;
}
public void setExcludeInstances(String excludeInstances) {
this.excludeInstances = excludeInstances;
}
public boolean isEnabled() {
return enabled;
}
public void setEnabled(boolean enabled) {
this.enabled = enabled;
}
public double getThreshold() {
return threshold;
}
public void setThreshold(double threshold) {
this.threshold = threshold;
}
public int getAlarmCountThreshold() {
return alarmCountThreshold;
}
public void setAlarmCountThreshold(int alarmCountThreshold) {
this.alarmCountThreshold = alarmCountThreshold;
}
private void start() {
this.scheduler = Schedulers.newSingle("jvm-check");
this.subscription = Flux.interval(Duration.ofSeconds(this.interval)).subscribeOn(this.scheduler).subscribe(this::checkFn);
}
private void stop() {
if (this.subscription != null) {
this.subscription.dispose();
this.subscription = null;
}
if (this.scheduler != null) {
this.scheduler.dispose();
this.scheduler = null;
}
}
}
由于这些JVM指标都可以通过调用 /actuator/metrics
来获取,如下所示可以看到所有的指标(我对指标进行了格式化):
{
"names": ["http.server.requests", "jvm.buffer.count", "jvm.buffer.memory.used", "jvm.buffer.total.capacity", "jvm.classes.loaded", "jvm.classes.unloaded", "jvm.gc.live.data.size", "jvm.gc.max.data.size", "jvm.gc.memory.allocated", "jvm.gc.memory.promoted", "jvm.gc.pause", "jvm.memory.committed", "jvm.memory.max", "jvm.memory.used", "jvm.threads.daemon", "jvm.threads.live", "jvm.threads.peak", "jvm.threads.states", "logback.events", "process.cpu.usage", "process.start.time", "process.uptime", "system.cpu.count", "system.cpu.usage", "tomcat.sessions.active.current", "tomcat.sessions.active.max", "tomcat.sessions.alive.max", "tomcat.sessions.created", "tomcat.sessions.expired", "tomcat.sessions.rejected"]
}
通过/actuator/metrics/具体指标名称
即可获取对应指标详情,比如获取JVM最大内存:/actuator/metrics/jvm.memory.max``,再往下要获取Tag指标使用jvm.memory.max?tag=area:heap
,其他指标同理,不再叙述。
{
"name": "jvm.memory.max",
"description": "The maximum amount of memory in bytes that can be used for memory management",
"baseUnit": "bytes",
"measurements": [{
"statistic": "VALUE",
"value": 4.256169984E9
}],
"availableTags": [{
"tag": "id",
"values": ["PS Eden Space", "PS Old Gen", "PS Survivor Space"]
}]
}
measurements 数组是我们需要获取的值,经过观察几个JVM指标都是只有一个值
,因此只要获取第一个数组即可。
这里有一段代码要说明下:
//更新累计超过阈值次数
if (jvmMax != 0 && jvmUsed / jvmMax > threshold) {
instanceCount.computeIfPresent(instanceName, (key, value) -> value + 1);
}
我为什么要加超过阈值的累计次数,在达到指定次数的时候才提醒,经过测试如果JVM占比很高的时候,一般会持续较长一段时间,如此每次达到阈值就会发生提醒消息,这样飞书收到的消息就太多了,价值不大,造成无意义的干扰;
另一方面虽然有时候JVM内存占用变高达到了阈值,但GC后,占比又下降了,如果这时候通知也是没什么参考价值的,所以设置累计次数,能够表明JVM占比很高已经持续了很长一段时间了,需要引起关注,才发送通知。
其他的就不过多解释了。