使用Prometheus+Grafana打造Flink监控页面

1. 下载对应的安装

组件

版本

下载地址

Prometheus

2.36.1

官网

Pusgateway

1.4.3

官网

node_exporter

1.3.1

官网

Grafana

8.5.6

官网

2. 安装并运行
1. 安装pushgateway

将pushgateway解压到指定目录后,直接运行pushgateway即可

启动pushgateway

nohup ./pushgateway --web.listen-address=":9092" > ./pushgateway-start.log 2>&1 &

将push gateway安装为系统服务,让systemctl来管理,直接修改/usr/lib/systemd/system/pushgateway.service,写入以下内容:

[Unit]
Description=pushgateway
After=local-fs.target network-online.target network.target
Wants=local-fs.target network-online.target network.target

[Service]
ExecStart=/opt/modules/pushgateway/pushgateway --web.listen-address=:9092
Restart=on-failure

[Install]
WantedBy=multi-user.target

设置开机自启

systemctl enable pushgateway
systemctl start pushgateway
systemctl status pushgateway
2. 安装node_exporter

将node_exporter解压到指定目录后,直接运行node_exporter即可

启动node_exporter

nohup ./node_exporter --web.listen-address=":9100" > ./node_exporter-start.log 2>&1 &

将node_exporter安装为系统服务,让systemctl来管理,直接修改/usr/lib/systemd/system/node_exporter.service,写入以下内容:

[Unit]
Description=node_exporter
After=local-fs.target network-online.target network.target
Wants=local-fs.target network-online.target network.target

[Service]
ExecStart=/opt/modules/node_exporter/node_exporter --web.listen-address=:9100
Restart=on-failure

[Install]
WantedBy=multi-user.target

设置开机自启

systemctl enable node_exporter
systemctl start node_exporter
systemctl status node_exporter
3. 安装Prometheus

解压Prometheus到安装目录,并修改目录下的prometheus.yml配置文件:

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
    # 默认端口是9090
      - targets: ["cdh1:9093"]
        labels:
          instances: prometheus

  - job_name: "pushgateway"
    # 默认为1分钟
    scrape_interval: 5s
    static_configs:
    # 默认端口是9091
      - targets: ["cdh1:9092"]
        labels:
          instances: pushgateway

  - job_name: "node_exporter"
    scrape_interval: 5s
    static_configs:
      - targets: ["cdh1:9100"]
        labels:
          instances: node_exporter

启动Prometheus服务

nohup ./prometheus --config.file=./prometheus.yml --web.listen-address=":9093" > ./prometheus-start.log 2>&1 &

启动之后在web页面上可以看到pushgateway、node_exporter以及prometheus的运行情况,地址ip:port/targets

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iARGX1s8-1655369166919)(http://yanko.test.upcdn.net/images/prometheus-targets.jpg)]

将prometheus安装为系统服务,让systemctl来管理,直接修改/usr/lib/systemd/system/prometheus.service,写入以下内容:

[Unit]
Description=prometheus
After=local-fs.target network-online.target network.target
Wants=local-fs.target network-online.target network.target

[Service]
ExecStart=/opt/modules/prometheus/prometheus --config.file=/opt/modules/prometheus/prometheus.yml --web.listen-address=:9093
Restart=on-failure

[Install]
WantedBy=multi-user.target

设置开机自启

systemctl enable prometheus
systemctl start prometheus
systemctl status prometheus
4. 安装Grafana
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-9.0.0-1.x86_64.rpm
sudo yum install grafana-enterprise-9.0.0-1.x86_64.rpm

启动Grafana服务

sudo systemctl start grafana-server

启动Grafana后,可以访问web页面,地址是ip:3000,默认用户名和密码是admin/admin

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-60H9LHUk-1655369166919)(http://yanko.test.upcdn.net/images/grafana-login.jpg)]

3. 配置Grafana数据源

进入Setting中选择Add Data source即可添加数据源,选择prometheus即可,添加完成后保存测试。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rVYP0qsI-1655369166920)(http://yanko.test.upcdn.net/images/grafana-prometheus.jpg)]

4. 配置Flink上报Metrics

修改flink-conf.yaml配置文件,新增如下配置:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
# pushgateway的地址和端口号
metrics.reporter.promgateway.host: cdh1
metrics.reporter.promgateway.port: 9092
metrics.reporter.promgateway.jobName: FlinkJob
# 是否自动生成JobName
metrics.reporter.promgateway.randomJobNameSuffix: true
# 作业结束后删除其对应的Metrics
metrics.reporter.promgateway.deleteOnShutdown: false
metrics.reporter.promgateway.groupingKey: k1=v1;k2=v2
# 上报Metrics的频率
metrics.reporter.promgateway.interval: 10 SECONDS

配置好之后运行flink的测试案例:

bin/flink run -d -t yarn-per-job ./examples/streaming/TopSpeedWindowing.jar

再次访问pushgateway的页面可以看到如下的metrics上报信息

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vQQ7KHGc-1655369166920)(http://yanko.test.upcdn.net/images/pushgateway-metrics.jpg)]

5. 登陆Grafana配置Flink监控面板

访问Grafana的dashboards页面,下载Flink的dashboards模板,官网下载地址https://grafana.com/grafana/dashboards/

下载好之后打开Grafana Web UI通过import导入刚刚所下载的Flink Metrics监控模板

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4Vq3RE7D-1655369166920)(http://yanko.test.upcdn.net/images/grafana-flink.jpg)]

之后打开该dashboards即可看到flink的metrics信息

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sNmZxruN-1655369166920)(http://yanko.test.upcdn.net/images/grafana-flink-metrics.jpg)]