首先我们看到Spring Boot Admin 的控制台是能看到很多监控指标的,如图:

springboot hikaricp 监控 springboot jvm监控_sed


在平时,我们发现服务实例异常,排查问题时,堆内存、GC、线程数量都是需要我们考虑的,因此也希望能够在JVM堆内存达到一定阈值的时候进行提醒,并通知 当前的堆大小,已使用大小,当前线程总数,等可以参考的指标。

然SBA2并没有提供该类告警事件,那么我们就自己来简单的实现这个告警。看下最终的飞书提醒效果:

springboot hikaricp 监控 springboot jvm监控_jvm_02

我们话不多说看代码:

AlarmMessage

public interface AlarmMessage {

    /**
     * 发送文本告警
     * @param content
     */
    void sendData(String content);
}

借用FeiShuNotifier类发送通知。
NotifierAutoConfiguration.jvmAlarm

@Bean(initMethod = "start", destroyMethod = "stop")
	@ConditionalOnProperty(prefix = "spring.boot.admin.notify.jvm", name = "enabled", havingValue = "true")
	@ConfigurationProperties("spring.boot.admin.notify.jvm")
	@Lazy(false)
	public JvmAlarm jvmAlarm(InstanceRepository repository, AlarmMessage alarmMessage) {
		return new JvmAlarm(repository, alarmMessage);
	}

定义了JVM告警配置,当"spring.boot.admin.notify.jvm.enabled=true开启JVM告警。
start初始化
JvmAlarm.start

private void start() {
        this.scheduler = Schedulers.newSingle("jvm-check");
        this.subscription = Flux.interval(Duration.ofSeconds(this.interval)).subscribeOn(this.scheduler).subscribe(this::checkFn);
    }

start 会创建一个用于定时检查JVM的定时任务。
stopBean销毁的时候同时销毁定时任务

private void stop() {
        if (this.subscription != null) {
            this.subscription.dispose();
            this.subscription = null;
        }
        if (this.scheduler != null) {
            this.scheduler.dispose();
            this.scheduler = null;
        }
    }

JvmAlarm

public class JvmAlarm {

    private final RestTemplate restTemplate = new RestTemplate();

    private Scheduler scheduler;

    private Disposable subscription;

    private InstanceRepository repository;


    private AlarmMessage alarmMessage;

    /**
     * jvm 阈值
     */
    private double threshold = 0.95;

    /**
     * 累计告警次数
     */
    private int alarmCountThreshold = 3;

    /**
     * 检测频率,秒
     */
    private long interval = 10;

    /**
     * 格式化模版
     */
    private final DecimalFormat df = new DecimalFormat("0.00M");

    /**
     * 排除实例
     */
    private String excludeInstances = "";

    /**
     * 开关
     */
    private boolean enabled = true;

    /**
     * 提醒模版
     */
    private final String ALARM_TPL = "服务实例【%s】,内存使用超阈值【%s】,累计【%s】次,当前最大内存【%s】,已使用【%s】,当前线程数【%s】";

    /**
     * 超过阈值次数
     */
    private final Map<String, Integer> instanceCount = new HashMap<>();

    public JvmAlarm(InstanceRepository repository, AlarmMessage alarmMessage) {
        this.repository = repository;
        this.alarmMessage = alarmMessage;
    }

    private void checkFn(Long aLong) {
        if (!enabled) {
            return;
        }
        log.debug("check jvm for all instances");
        repository.findAll().filter(instance -> !excludeInstances.contains(instance.getRegistration().getName())).map(instance -> {
            String instanceName = instance.getRegistration().getName();
            //最大堆空间
            double jvmMax = getJvmValue(instance, "jvm.memory.max?tag=area:heap") / (1024*1024d);
            //已使用堆空间
            double jvmUsed = getJvmValue(instance, "jvm.memory.used?tag=area:heap") /  (1024*1024d);
            if (jvmMax != 0 && jvmUsed / jvmMax > threshold && instanceCount.computeIfAbsent(instanceName, key -> 0) > alarmCountThreshold) {
                //当前活跃线程数
                int threads = (int) getJvmValue(instance, "jvm.threads.live");
                String content = String.format(ALARM_TPL, instanceName, (threshold * 100) + "%", alarmCountThreshold, df.format(jvmMax), df.format(jvmUsed), threads);
                alarmMessage.sendData(content);
                //重新计算
                instanceCount.remove(instanceName);
            }
            //更新累计超过阈值次数
            if (jvmMax != 0 && jvmUsed / jvmMax > threshold) {
                instanceCount.computeIfPresent(instanceName, (key, value) -> value + 1);
            }
            return Mono.just(0d);
        }).subscribe();
    }

    private long getJvmValue(Instance instance, String tags) {
        try {
            String reqUrl = instance.getRegistration().getManagementUrl() + "/metrics/" + tags;
            log.debug("check jvm {},uri {}", instance.getRegistration().getName(), reqUrl);
            ResponseEntity<String> responseEntity = restTemplate.getForEntity(reqUrl, String.class);
            String body = responseEntity.getBody();
            JSONObject bodyObject = JSON.parseObject(body);
            JSONArray measurementsArray = bodyObject.getJSONArray("measurements");
            if (measurementsArray != null && !measurementsArray.isEmpty()) {
                return measurementsArray.getJSONObject(0).getLongValue("value");
            }
        } catch (Exception ex) {
            log.error(ex.getMessage());
        }
        return 0L;
    }

    public long getInterval() {
        return interval;
    }

    public void setInterval(long interval) {
        this.interval = interval;
    }

    public String getExcludeInstances() {
        return excludeInstances;
    }

    public void setExcludeInstances(String excludeInstances) {
        this.excludeInstances = excludeInstances;
    }

    public boolean isEnabled() {
        return enabled;
    }

    public void setEnabled(boolean enabled) {
        this.enabled = enabled;
    }

    public double getThreshold() {
        return threshold;
    }

    public void setThreshold(double threshold) {
        this.threshold = threshold;
    }

    public int getAlarmCountThreshold() {
        return alarmCountThreshold;
    }

    public void setAlarmCountThreshold(int alarmCountThreshold) {
        this.alarmCountThreshold = alarmCountThreshold;
    }

    private void start() {
        this.scheduler = Schedulers.newSingle("jvm-check");
        this.subscription = Flux.interval(Duration.ofSeconds(this.interval)).subscribeOn(this.scheduler).subscribe(this::checkFn);
    }

    private void stop() {
        if (this.subscription != null) {
            this.subscription.dispose();
            this.subscription = null;
        }
        if (this.scheduler != null) {
            this.scheduler.dispose();
            this.scheduler = null;
        }
    }
}

由于这些JVM指标都可以通过调用 /actuator/metrics来获取,如下所示可以看到所有的指标(我对指标进行了格式化):

{
	"names": ["http.server.requests", "jvm.buffer.count", "jvm.buffer.memory.used", "jvm.buffer.total.capacity", "jvm.classes.loaded", "jvm.classes.unloaded", "jvm.gc.live.data.size", "jvm.gc.max.data.size", "jvm.gc.memory.allocated", "jvm.gc.memory.promoted", "jvm.gc.pause", "jvm.memory.committed", "jvm.memory.max", "jvm.memory.used", "jvm.threads.daemon", "jvm.threads.live", "jvm.threads.peak", "jvm.threads.states", "logback.events", "process.cpu.usage", "process.start.time", "process.uptime", "system.cpu.count", "system.cpu.usage", "tomcat.sessions.active.current", "tomcat.sessions.active.max", "tomcat.sessions.alive.max", "tomcat.sessions.created", "tomcat.sessions.expired", "tomcat.sessions.rejected"]
}

通过/actuator/metrics/具体指标名称即可获取对应指标详情,比如获取JVM最大内存:/actuator/metrics/jvm.memory.max``,再往下要获取Tag指标使用jvm.memory.max?tag=area:heap,其他指标同理,不再叙述。

{
	"name": "jvm.memory.max",
	"description": "The maximum amount of memory in bytes that can be used for memory management",
	"baseUnit": "bytes",
	"measurements": [{
		"statistic": "VALUE",
		"value": 4.256169984E9
	}],
	"availableTags": [{
		"tag": "id",
		"values": ["PS Eden Space", "PS Old Gen", "PS Survivor Space"]
	}]
}

measurements 数组是我们需要获取的值,经过观察几个JVM指标都是只有一个值,因此只要获取第一个数组即可。
这里有一段代码要说明下:

//更新累计超过阈值次数
if (jvmMax != 0 && jvmUsed / jvmMax > threshold) {
	instanceCount.computeIfPresent(instanceName, (key, value) -> value + 1);
}

我为什么要加超过阈值的累计次数,在达到指定次数的时候才提醒,经过测试如果JVM占比很高的时候,一般会持续较长一段时间,如此每次达到阈值就会发生提醒消息,这样飞书收到的消息就太多了,价值不大,造成无意义的干扰;
另一方面虽然有时候JVM内存占用变高达到了阈值,但GC后,占比又下降了,如果这时候通知也是没什么参考价值的,所以设置累计次数,能够表明JVM占比很高已经持续了很长一段时间了,需要引起关注,才发送通知。
其他的就不过多解释了。