摘要:

Prometheus Operator;ServiceMonitor;Metrics;Secret;kind;blackbox-exporter;mysql-exporter;redis-exporter

1、prometheus监控配置原理图

  • ServiceMonitor 是 Prometheus Operator 提供的一种 Kubernetes 自定义资源(Custom Resource),通过 ServiceMonitor,可以在 Kubernetes 中动态地配置 Prometheus 来监控新创建的服务,而无需手动修改 Prometheus 的配置文件。
  • 当 ServiceMonitor 被创建时,Prometheus Operator 会根据 ServiceMonitor 中定义的标签选择器来自动发现符合条件的 Service。ServiceMonitor 和要监控的 Service 是通过标签关联的,通过在 Service 和 ServiceMonitor 的标签选择器中使用相同的标签,将 ServiceMonitor 与目标 Service 相关联起来

prometheus监控istio(blackbox-exporter/mysql-exporter/redis-exporter)_prometheus  istio

ServiceMonitor找不到监控主机排查步骤:

  • 确认Service Monitor是否成功创建
  • 确认Service Monitor标签是否配置正确
  • 确认Prometheus是否生成了相关配置
  • 确认存在Service Monitor匹配的Service
  • 确认通过Service能够访问程序的Metrics接口
  • 确认Service的端口和Scheme和Service Monitor一致

2、排查Prometheus【Alerts】显示KubeControllerManagerDown。检查创建Service Monitor、Service和Endpoint

  • 注意:Service和Endpoint的ports.name和label必须跟kube-controller-manager的Service Monitor配置的一样;
  • 确认Service的端口和Scheme和Service Monitor一致
vim  /usr/lib/systemd/system/kube-controller-manager.service
      --bind-address=192.168.31.213 \     //master节点的IP或者是--bind-address=0.0.0.0
      --authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
      --authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig
# systemctl daemon-reload && systemctl restart kube-controller-manager
# kubectl get svc -n kube-system kube-controller-manager-prom
NAME                           TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)     AGE
kube-controller-manager-prom   ClusterIP   10.16.82.26   <none>        10257/TCP   142m
[root@k8s-master02 ~]# curl https://10.16.82.26:10257 -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}
  • 必须加这2个authentication-kubeconfig/authentication-kubeconfig参数(指定kube-controller-manager在运行时使用的kubeconfig文件的位置,kubeconfig文件包含了与Kubernetes API服务器进行认证和授权所需的信息),否则Prometheus--Targets报错“server returned HTTP status 403 Forbidden
  • 三台Master节点都得配置并重启kube-controller-manager。

3、目的:prometheus监控Istio控制平面istiod(集群中安装了2套prometheus技术栈

  • k8s集群在monitoring命名空间安装了prometheus技术栈(prometheus-operator、prometheus-adapter、grafana),创建ServiceMonitor,选择器匹配到istio-system命名空间的(service)istiod,但是prometheus的Web界面没有关于istio的度量指标。这里有prometheus-operator,所以是monitoring命名空间下的prometheus-operator创建ServiceMonitor这个CRD资源对象,在(monitoring命名空间Prometheus界面的Configuration下可以搜索到配置记录“job_name: serviceMonitor/monitoring/istio-component-monitor/0”,但问题是Prometheus界面的Targets里没有出现这一项(?原因不清楚
# k -n monitoring get deploy
NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
blackbox-exporter     1/1     1            1           506d
grafana               1/1     1            1           506d
kube-state-metrics    1/1     1            1           506d
mysql-exporter        1/1     1            1           502d
prometheus-adapter    2/2     2            2           506d
prometheus-operator   1/1     1            1           506d
# kubectl -n monitoring get secret prometheus-k8s -ojson | jq -r '.data["prometheus.yaml.gz"]' | base64 -d | gunzip |grep "istio-component-monitor"
- job_name: serviceMonitor/monitoring/istio-component-monitor/0
  • ServiceMonitor对象和它们所属的命名空间由Prometheus对象的serviceMonitorSelector和serviceMonitorNamespaceSelector选择。ServiceMonitor的名称在Prometheus配置中编码,因此可以简单地判断它是否存在。Prometheus操作符生成的配置存储在Kubernetes Secret中,以Prometheus对象名命名,前缀为prometheus-,并且与Prometheus对象位于同一命名空间中(prometheus-k8s)。例如,对于名为k8s的Prometheus对象,可以找出名为istio-component-monitor的ServiceMonitor是否监视器已被拾取。
# kubectl -n monitoring get Prometheus
NAME   VERSION   REPLICAS   AGE
k8s    2.32.1    2          507d
  • 发现:k8s集群在istio-system命名空间也安装了prometheus技术栈(prometheus、kiali、jaeger、grafana)将istio-system命名空间的service:prometheus修改为NodePort,在该prometheus的Web界面关于istio的度量指标。没有prometheus-operator。所以不会根据yaml创建ServiceMonitor个CRD资源对象
# k -n istio-system get deploy
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
grafana                1/1     1            1           367d
jaeger                 1/1     1            1           367d
kiali                  1/1     1            1           367d
prometheus             1/1     1            1           367d
istiod                 1/1     1            1           367d
#  k -n istio-system get svc -l operator.istio.io/component=Pilot
NAME     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                 AGE
istiod   ClusterIP   10.16.21.230   <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP   366d
[root@k8s-master01 ~]# curl http://10.16.21.230:15014 -k
404 page not found
[root@k8s-master01 ~]# curl https://10.16.21.230:15014 -k
curl: (35) SSL received a record that exceeded the maximum permissible length.
  •  prometheus operator启用对所有命名空间中的所有内容的监控:github.com(helm-charts/charts/kube-prometheus-stack/values.yaml)
  • 2016年年末,CoreOs引入了Operator模式,并发布了Prometheus Operator 作为Operator模式的工作示例。Prometheus Operator自动创建和管理Prometheus监控实例。使得在Kubernetes运行Prometheus尽可能容易,同时保留可配置性以及使Kubernetes配置原生。使用 Prometheus Operator,可以非常轻松地部署 Prometheus、Alertmanager、Prometheus 警报规则和服务监视器

prometheus监控istio(blackbox-exporter/mysql-exporter/redis-exporter)_prometheus  istio_02

  • Prometheus Operator作为一个控制器,他会去创建Prometheus、PodMonitor、ServiceMonitor、AlertManager以及PrometheusRule这5个CRD资源对象,然后会一直监控并维持这5个资源对象的状态。
  • ServiceMonitor和PodMonitor资源对象是专门的提供metrics数据接口的exporter的抽象,Prometheus就是通过PodMonitor和ServiceMonitor提供的metrics数据接口去pull数据的;ServiceMonitor声明式指定应该如何监控服务组;Operator根据定义自动创建Prometheusscrape配置(scrape_configs:- job_name:serviceMonitor/monitoring/kube-controller-manager/0。该资源通过标签来选取对应的Service Endpoint,让Prometheus Server通过选取的Service来获取Metrics信息
# kubectl -n monitoring get Prometheus k8s -oyaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
spec:
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  probeNamespaceSelector: {}
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
-------------------github.com-----helm-charts/charts/kube-prometheus-stack/values.yaml---------------------
## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
    ## prometheus resource to be created with selectors based on values in the helm deployment,
    ## which will also match the servicemonitors created
    serviceMonitorSelectorNilUsesHelmValues: true
    ## ServiceMonitors to be selected for target discovery.
    ## If {}, select all ServiceMonitors
    serviceMonitorSelector: {}
    ## Example which selects ServiceMonitors with label "prometheus" set to "somelabel"
    # serviceMonitorSelector:
    #   matchLabels:
    #     prometheus: somelabel
    ## Namespaces to be selected for ServiceMonitor discovery.
    ## See https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#namespaceselector for usage
    serviceMonitorNamespaceSelector: {}

4、查询k8s集群所有资源的kind种类。

# kubectl api-resources --verbs=list --namespaced -o name

prometheus监控istio(blackbox-exporter/mysql-exporter/redis-exporter)_prometheus  istio_03

5、blackbox-exporter监控排故:Prometheus网页Alerts告警“PrometheusOperatorSyncFailed (1 active)”,原因prometheus类型的CRD:k8s中配置key少写了一个字母i(prometheus-addtional.yaml---->prometheus-additional.yaml)

# k -n monitoring edit  prometheus  k8s 
spec:
  additionalScrapeConfigs:
    key: prometheus-additional.yaml
    name: additional-configs
    optional: true

这个配置指向Secret:additional-configs中的key的名称。(prometheus-addtional.yaml: I2JsYWNrYm945)

# k -n monitoring get Secret additional-configs -oyaml
apiVersion: v1
kind: Secret
type: Opaque
data:
  prometheus-addtional.yaml: I2JsYWNrYm9455qE6......
查看key值的方法:
# echo """I2JsYWNrYm9455qE6......""" |base64 -d
  - job_name: 'k8s/additionalScrapeConfigs/blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
......

因为key名称不对,报错提示如下:“key prometheus-addtional.yaml could not be found in Secret additional-configs”

# k -n monitoring logs prometheus-operator-74b8d5646f-nms2j
level=error ts=2023-11-06T01:22:19.261569193Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="Sync \"monitoring/k8s\" failed: creating config failed: loading additional scrape configs from Secret failed: key prometheus-addtional.yaml could not be found in Secret additional-configs"

修正后日志: msg="sync prometheus"

level=info ts=2023-11-06T01:46:27.745466848Z caller=operator.go:1218 component=prometheusoperator key=monitoring/k8s msg="sync prometheus"

6、mysql-exporter:监控mysql(Server version: 5.7.23),mysql创建用户并授权:

# k get pod,svc -owide
NAME                           STATUS    RESTARTS   AGE    IP             
pod/mysql-744d5546f7-8tpfm      Running   0          6d5h   172.25.92.126   
NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   SELECTOR
service/mysql    lusterIP   10.16.166.195   <none>        3306/TCP    app=mysql
# telnet 10.16.166.195 3306
Trying 10.16.166.195...
Connected to 10.16.166.195.
Escape character is '^]'.
J
5.7.23φ-9 0F-|m(Y],Z[3mysql_native_passwordConnection closed by foreign host.
# mysql -uroot -pmysql -h 10.16.166.195 -P 3306
CREATE USER 'exporter'@'%' IDENTIFIED BY 'mipw' WITH MAX_USER_CONNECTIONS 3;
GRANT  PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
//Your password does not satisfy the current policy requirements;查看密码复杂性设置
SHOW VARIABLES LIKE 'validate_password%';

创建(Deployment)mysql-exporter和service;

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: mysql-exporter
  template:
    metadata:
      labels:
        k8s-app: mysql-exporter
    spec:
      containers:
      - name: mysql-exporter
        env: 
        - name: DATA_SOURCE_NAME
          value: "exporter:mipw@tcp(mysql.default:3306)/"
          value: "root:mysql@tcp(mysql.default:3306)/"          
          value: "root:123456@(172.26.54.225:3306)/"
# wordpress-mysql-668d75584d-zmj6b  172.26.54.225 因为CLUSTER-IP是None,所以只能写IP; 最后一个value生效
        image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9104
          protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-exporter
  namespace: monitoring
  labels:
    k8s-app: mysql-exporter
spec:
  type: ClusterIP
  selector:
    k8s-app: mysql-exporter
  ports:
  - name: api
    port: 9104
    protocol: TCP
    targetPort: 9104

问题处理:由于Deployment的value: root:123456@(172.26.54.225:3306)/"引号问题,红色的为中文引号,导致报错Access denied for user '“exporter'@'172.26.54.236',改正为英文引号",报错消失。

# k -n monitoring logs mysql-exporter-54976bd667-xcjll
level=error ts=2023-11-07T02:00:15.025Z caller=exporter.go:149 msg="Error pinging mysqld" err="Error 1045: Access denied for user '“exporter'@'172.26.54.236' (using password: YES)"

通过curl访问mysql-exporter判断是否采集到数据。

# k -n monitoring get svc,pod  -owide
service/mysql-exporter          ClusterIP   10.16.110.38      9104/TCP    k8s-app=mysql-exporter
pod/mysql-exporter-5c884ffc6f-rwxrt        Running   172.25.214.252
# curl 172.25.214.252:9104/metrics
mysql_up 0
# HELP mysqld_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which mysqld_exporter was built.
# TYPE mysqld_exporter_build_info gauge
mysqld_exporter_build_info{branch="HEAD",goversion="go1.16.4",revision="ad2847c7fa67b9debafccd5a08bacb12fc9031f1",version="0.13.0"} 1

然后创建 ServiceMonitor,(port: api是mysql-exporter的端口9104的名称; matchLabels通过标签选择service/mysql-exporter)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mysql-exporter
  namespace: monitoring
  labels:
    app: mysql-exporter
    namespace: monitoring
spec:
  jobLabel: k8s-app
  endpoints:
  - interval: 30s
    port: api
    scheme: http
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      k8s-app: mysql-exporter
验证通过标签选择的service:
# k -n monitoring get svc -l k8s-app=mysql-exporter
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
mysql-exporter   ClusterIP   10.16.110.38   <none>        9104/TCP   112m

在Prometheus网页Targets下出现

serviceMonitor/monitoring/mysql-exporter/0 (1/1 up)

Endpoint

State

Labels

Last Scrape

Scrape Duration

Error

http://172.25.214.252:9104/metrics

UP

container="mysql-exporter"endpoint="api"instance="172.25.214.252:9104"job="mysql-exporter"namespace="monitoring"pod="mysql-exporter-5c884ffc6f-rwxrt"service="mysql-exporter"

18.728s ago

8.520ms

Grafana导入 dashboard(Import dashboard from file or Grafana.com),Grafana监控mysql正常。

https://grafana.com/grafana/dashboards/6239
https://grafana.com/grafana/dashboards/7362
https://grafana.com/grafana/dashboards/11329

7、在部署mysql服务的物理主机安装mysql-exporter 实现 Exporter for MySQL server metrics。

# more my.cnf 
[client]
user=exporter
password=qaz_WSX1
# ./mysqld_exporter --config.my-cnf=/home/mysqld_exporter/my.cnf

创建指向该物理主机的Endpoints,以及Service,ServiceMonitor。

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: airnet-exporter
  namespace: monitoring
  labels:
    app: airnet-exporter
    namespace: monitoring
spec:
  jobLabel: airnet-app
  endpoints:
  - interval: 30s
    port: api
    scheme: http    
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      k8s-app: airnet-exporter     
---
apiVersion: v1
kind: Service
metadata:
  name: airnet-exporter
  namespace: monitoring
  labels:
    k8s-app: airnet-exporter
spec:
  type: ClusterIP
  selector:
    k8s-app: airnet-exporter
  ports:
  - name: api
    port: 9104
    protocol: TCP
    targetPort: 9104         
---
kind: Endpoints
apiVersion: v1
metadata:
  name: airnet-exporter
  namespace: monitoring  
  labels:
    k8s-app: airnet-exporter 
subsets:
  - addresses:
      - ip: 192.168.31.158
    ports:
      - port: 9104
        name: api
        protocol: TCP

8、监控redis。ServiceMonitor通过labels选择多个Service(实现多台物理服务器的Service/Endpoints监控)

/usr/lib/systemd/system/redis_exporter.service
[Unit]
Description=redis_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/redis_exporter -redis.addr  127.0.0.1:6379  -redis.password cdatc  -web.listen-address :59121
Restart=always
[Install]
WantedBy=multi-user.target
# systemctl enable redis_exporter.service
# systemctl start  redis_exporter.service
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: redis-exporter
  namespace: monitoring
  labels:
    k8s-app: redis-exporter
    namespace: monitoring
spec:
  jobLabel: redis-app
  endpoints:
  - interval: 30s
    port: api
    scheme: http    
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      k8s-app: redis-exporter     
---
apiVersion: v1
kind: Service
metadata:
  name: redis-exporter
  namespace: monitoring
  labels:
    k8s-app: redis-exporter
spec:
  type: ClusterIP
  selector:
    k8s-app: redis-exporter
  ports:
  - name: api
    port: 59121
    protocol: TCP
    targetPort: 59121         
---
kind: Endpoints
apiVersion: v1
metadata:
  name: redis-exporter
  namespace: monitoring  
  labels:
    k8s-app: redis-exporter
subsets:
  - addresses:
      - ip: 192.168.31.184
    ports:
      - port: 59121
        name: api
        protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  name: fdpredis-exporter
  namespace: monitoring
  labels:
    k8s-app: redis-exporter
spec:
  type: ClusterIP
  selector:
    k8s-app: fdpredis-exporter
  ports:
  - name: api
    port: 59121
    protocol: TCP
    targetPort: 59121 
---
kind: Endpoints
apiVersion: v1
metadata:
  name: fdpredis-exporter
  namespace: monitoring  
  labels:
    k8s-app: fdpredis-exporter
subsets:
  - addresses:
      - ip: 192.168.31.158
    ports:
      - port: 59121
        name: api
        protocol: TCP

问题:手动指定的外部Endpoints的IP地址过一段时间自动变为空值,导致Grafana监控无数据:

——原因是对于没有selector的service不会出现Endpoint的信息,需要手工创建Endpoint绑定,Endpoint可以是内部的pod,也可以是外部的服务。以上例子中是配置了selector的service,Service Controller会自动查找匹配这个selector的pod,并且创建出一个同名的endpoint对象(没有查找到匹配这个selector的pod),所以出现endpoint的IP地址为空的情况(因为同名,覆写了手工创建Endpoint外部IP绑定)。