容器云平台No.7~kubernetes监控系统prometheus

scofield 菜鸟运维杂谈

简介


prometheus-operator Prometheus:一个非常优秀的监控工具或者说是监控方案。它提供了数据搜集、存储、处理、可视化和告警一套完整的解决方案。作为kubernetes官方推荐的监控系统,用Prometheus来监控kubernetes集群的状况和运行在集群上的应用运行状况。

Prometheus架构图

那Prometheus Operator是做什么的呢? Operator是由CoreOS公司开发的,用来扩展 Kubernetes API,特定的应用程序控制器,它用来创建、配置和管理复杂的有状态应用,如数据库、缓存和监控系统。 可以理解为,Prometheus Operator就是用于管理部署Prometheus到kubernetes的工具,其目的是简化和自动化对Prometheus组件的维护。

Prometheus Operator架构

部署前准备


1、克隆kube-prometheus项目


1[root@k8s-master001 opt]# git clone
``` https://github.com/prometheus-operator/kube-prometheus.git
2、进入kube-prometheus/manifests目录,可以看到一堆yaml文件,文件太多,我们按用组件分类


1[root@k8s-master001 manifests]# ls -al 2total 20 3drwxr-xr-x. 10 root root 140 Sep 14 21:25 . 4drwxr-xr-x. 12 root root 4096 Sep 14 21:11 .. 5drwxr-xr-x. 2 root root 4096 Sep 14 21:23 adapter 6drwxr-xr-x. 2 root root 189 Sep 14 21:22 alertmanager 7drwxr-xr-x. 2 root root 241 Sep 14 21:22 exporter 8drwxr-xr-x. 2 root root 254 Sep 14 21:23 grafana 9drwxr-xr-x. 2 root root 272 Sep 14 21:22 metrics 10drwxr-xr-x. 2 root root 4096 Sep 14 21:25 prometheus 11drwxr-xr-x. 2 root root 4096 Sep 14 21:23 serviceMonitor 12drwxr-xr-x. 2 root root 4096 Sep 14 21:11 setup

3、修改yaml文件中的nodeSelector
首先查看下现在Node节点的标签


1[root@k8s-master001 manifests]# kubectl get node --show-labels=true 2NAME STATUS ROLES AGE VERSION LABELS 3k8s-master001 Ready master 4d16h v1.19.0 app.storage=rook-ceph,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master001,kubernetes.io/os=linux,node-role.kubernetes.io/master= 4k8s-master002 Ready master 4d16h v1.19.0 app.storage=rook-ceph,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master002,kubernetes.io/os=linux,node-role.kubernetes.io/master= 5k8s-master003 Ready master 4d16h v1.19.0 app.storage=rook-ceph,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master003,kubernetes.io/os=linux,node-role.kubernetes.io/master=,role=ingress-controller

并把manifests目录的yaml文件中nodeSelector改为kubernetes.io/os=linux

例如:vim setup/prometheus-operator-deployment.yaml,


1 nodeSelector: 2 kubernetes.io/os: linux

其他的自行修改,可以如下命令过滤并查看是否需要修改


1[root@k8s-master001 manifests]# grep -A1 nodeSelector prometheus/* 2prometheus/prometheus-prometheus.yaml: nodeSelector: 3prometheus/prometheus-prometheus.yaml: nodeSelector: 4prometheus/prometheus-prometheus.yaml- kubernetes.io/os: linux

### 部署kube-prometheus

-----



1、安装operator


1[root@k8s-master001 manifests]# kubectl apply -f setup/ 2namespace/monitoring created 3customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com created 4customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com created 5customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com created 6customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com created 7customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com created 8customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com created 9customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com created 10clusterrole.rbac.authorization.k8s.io/prometheus-operator created 11clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator created 12deployment.apps/prometheus-operator created 13service/prometheus-operator created 14serviceaccount/prometheus-operator created 15 16[root@k8s-master001 manifests]# kubectl get po -n monitoring 17NAME READY STATUS RESTARTS AGE 18prometheus-operator-74d54b5cfc-xgqg7 2/2 Running 0 2m40s

2、安装adapter


1[root@k8s-master001 manifests]# kubectl apply -f adapter/ 2apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created 3clusterrole.rbac.authorization.k8s.io/prometheus-adapter created 4clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created 5clusterrolebinding.rbac.authorization.k8s.io/prometheus-adapter created 6clusterrolebinding.rbac.authorization.k8s.io/resource-metrics:system:auth-delegator created 7clusterrole.rbac.authorization.k8s.io/resource-metrics-server-resources created 8configmap/adapter-config created 9deployment.apps/prometheus-adapter created 10rolebinding.rbac.authorization.k8s.io/resource-metrics-auth-reader created 11service/prometheus-adapter created 12serviceaccount/prometheus-adapter created 13servicemonitor.monitoring.coreos.com/prometheus-adapter created 14 15[root@k8s-master001 manifests]# kubectl get po -n monitoring 16NAME READY STATUS RESTARTS AGE 17prometheus-adapter-557648f58c-9x446 1/1 Running 0 41s 18prometheus-operator-74d54b5cfc-xgqg7 2/2 Running 0 4m33s

3、安装alertmanager


1[root@k8s-master001 manifests]# kubectl apply -f alertmanager/ 2alertmanager.monitoring.coreos.com/main created 3secret/alertmanager-main created 4service/alertmanager-main created 5serviceaccount/alertmanager-main created 6servicemonitor.monitoring.coreos.com/alertmanager created 7 8[root@k8s-master001 ~]# kubectl get po -n monitoring 9NAME READY STATUS RESTARTS AGE 10alertmanager-main-0 2/2 Running 0 53m 11alertmanager-main-1 2/2 Running 0 3m3s 12alertmanager-main-2 2/2 Running 0 53m

4、安装exporter


1[root@k8s-master001 manifests]# kubectl apply -f exporter/ 2clusterrole.rbac.authorization.k8s.io/node-exporter created 3clusterrolebinding.rbac.authorization.k8s.io/node-exporter created 4daemonset.apps/node-exporter created 5service/node-exporter created 6serviceaccount/node-exporter created 7servicemonitor.monitoring.coreos.com/node-exporter created 8 9[root@k8s-master001 manifests]# kubectl get po -n monitoring 10NAME READY STATUS RESTARTS AGE 11node-exporter-2rvtt 2/2 Running 0 108s 12node-exporter-9kwb6 2/2 Running 0 108s 13node-exporter-9zlbb 2/2 Running 0 108s

5、安装metrics

1[root@k8s-master001 manifests]# kubectl apply -f metrics 2clusterrole.rbac.authorization.k8s.io/kube-state-metrics created 3clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created 4deployment.apps/kube-state-metrics created 5service/kube-state-metrics created 6serviceaccount/kube-state-metrics created 7servicemonitor.monitoring.coreos.com/kube-state-metrics created 8 9[root@k8s-master001 manifests]# kubectl get po -n monitoring 10NAME READY STATUS RESTARTS AGE 11kube-state-metrics-85cb9cfd7c-v9c4f 3/3 Running 0 2m8s

6、安装prometheus


1[root@k8s-master001 manifests]# kubectl apply -f prometheus/ 2clusterrole.rbac.authorization.k8s.io/prometheus-k8s created 3clusterrolebinding.rbac.authorization.k8s.io/prometheus-k8s created 4servicemonitor.monitoring.coreos.com/prometheus-operator created 5prometheus.monitoring.coreos.com/k8s created 6rolebinding.rbac.authorization.k8s.io/prometheus-k8s-config created 7rolebinding.rbac.authorization.k8s.io/prometheus-k8s created 8rolebinding.rbac.authorization.k8s.io/prometheus-k8s created 9rolebinding.rbac.authorization.k8s.io/prometheus-k8s created 10role.rbac.authorization.k8s.io/prometheus-k8s-config created 11role.rbac.authorization.k8s.io/prometheus-k8s created 12role.rbac.authorization.k8s.io/prometheus-k8s created 13role.rbac.authorization.k8s.io/prometheus-k8s created 14prometheusrule.monitoring.coreos.com/prometheus-k8s-rules created 15service/prometheus-k8s created 16serviceaccount/prometheus-k8s created 17 18[root@k8s-master001 manifests]# kubectl get po -n monitoring 19NAME READY STATUS RESTARTS AGE 20prometheus-k8s-0 3/3 Running 1 94s 21prometheus-k8s-1 3/3 Running 1 94s

7、安装grafana


1root@k8s-master001 manifests]# kubectl apply -f grafana/ 2secret/grafana-datasources created 3configmap/grafana-dashboard-apiserver created 4configmap/grafana-dashboard-cluster-total created 5configmap/grafana-dashboard-controller-manager created 6configmap/grafana-dashboard-k8s-resources-cluster created 7configmap/grafana-dashboard-k8s-resources-namespace created 8configmap/grafana-dashboard-k8s-resources-node created 9configmap/grafana-dashboard-k8s-resources-pod created 10configmap/grafana-dashboard-k8s-resources-workload created 11configmap/grafana-dashboard-k8s-resources-workloads-namespace created 12configmap/grafana-dashboard-kubelet created 13configmap/grafana-dashboard-namespace-by-pod created 14configmap/grafana-dashboard-namespace-by-workload created 15configmap/grafana-dashboard-node-cluster-rsrc-use created 16configmap/grafana-dashboard-node-rsrc-use created 17configmap/grafana-dashboard-nodes created 18configmap/grafana-dashboard-persistentvolumesusage created 19configmap/grafana-dashboard-pod-total created 20configmap/grafana-dashboard-prometheus-remote-write created 21configmap/grafana-dashboard-prometheus created 22configmap/grafana-dashboard-proxy created 23configmap/grafana-dashboard-scheduler created 24configmap/grafana-dashboard-statefulset created 25configmap/grafana-dashboard-workload-total created 26configmap/grafana-dashboards created 27deployment.apps/grafana created 28service/grafana created 29serviceaccount/grafana created 30servicemonitor.monitoring.coreos.com/grafana created 31 32[root@k8s-master001 manifests]# kubectl get po -n monitoring 33NAME READY STATUS RESTARTS AGE 34grafana-b558fb99f-87spq 1/1 Running 0 3m14s

8、安装serviceMonitor


1[root@k8s-master001 manifests]# kubectl apply -f serviceMonitor/ 2servicemonitor.monitoring.coreos.com/prometheus created 3servicemonitor.monitoring.coreos.com/kube-apiserver created 4servicemonitor.monitoring.coreos.com/coredns created 5servicemonitor.monitoring.coreos.com/kube-controller-manager created 6servicemonitor.monitoring.coreos.com/kube-scheduler created 7servicemonitor.monitoring.coreos.com/kubelet created

9、查看全部运行的服务


1[root@k8s-master001 manifests]# kubectl get po -n monitoring 2NAME READY STATUS RESTARTS AGE 3alertmanager-main-0 2/2 Running 0 90m 4alertmanager-main-1 2/2 Running 0 40m 5alertmanager-main-2 2/2 Running 0 90m 6grafana-b558fb99f-87spq 1/1 Running 0 4m56s 7kube-state-metrics-85cb9cfd7c-v9c4f 3/3 Running 0 10m 8node-exporter-2rvtt 2/2 Running 0 35m 9node-exporter-9kwb6 2/2 Running 0 35m 10node-exporter-9zlbb 2/2 Running 0 35m 11prometheus-adapter-557648f58c-9x446 1/1 Running 0 91m 12prometheus-k8s-0 3/3 Running 1 7m49s 13prometheus-k8s-1 3/3 Running 1 7m49s 14prometheus-operator-74d54b5cfc-xgqg7 2/2 Running 0 95m 15 16NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 17service/alertmanager-main ClusterIP 10.98.96.94 <none> 9093/TCP 91m 18service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 91m 19service/grafana ClusterIP 10.108.204.33 <none> 3000/TCP 6m30s 20service/kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 12m 21service/node-exporter ClusterIP None <none> 9100/TCP 36m 22service/prometheus-adapter ClusterIP 10.98.16.117 <none> 443/TCP 93m 23service/prometheus-k8s ClusterIP 10.109.119.37 <none> 9090/TCP 9m22s 24service/prometheus-operated ClusterIP None <none> 9090/TCP 9m24s 25service/prometheus-operator ClusterIP None <none> 8443/TCP 97m

10、使用nodeport暴露grafana和prometheus服务,访问UI界面


1--- 2apiVersion: v1 3kind: Service 4metadata: 5 name: grafana-svc 6 namespace: monitoring 7spec: 8 type: NodePort 9 ports: 10 - port: 3000 11 targetPort: 3000 12 selector: 13 app: grafana 14--- 15apiVersion: v1 16kind: Service 17metadata: 18 name: prometheus-svc 19 namespace: monitoring 20spec: 21 type: NodePort 22 ports: 23 - port: 9090 24 targetPort: 9090 25 selector: 26 prometheus: k8s

查看结果


1[root@k8s-master001 manifests]# kubectl get svc -n monitoring 2NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 3grafana-svc NodePort 10.99.31.100 <none> 3000:30438/TCP 9s 4prometheus-svc NodePort 10.102.245.8 <none> 9090:32227/TCP 3s

现在可以使用浏览器访问URL NodeIP:30438 NodeIP:32227 : NodeIP为k8s节点IP,当然也可以使用前文介绍的ingress暴露服务
例如:
prometheus:http://10.26.25.20:32227

![](https://s4.51cto.com/images/blog/202103/12/e419b394e527d28549ddfea257730ebf.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)

grafana:http://10.26.25.20:30438 默认密码admin/admin,登录后需要修改admin密码
![](https://s4.51cto.com/images/blog/202103/12/40f3cef69a1f2fbed35313cc9dbc31be.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)

以上,kube-prometheus已经部署完毕,可以用过prometheus查看到监控信息了。


### 几个小坑

-----



### 坑位一

1、从prometheus target可以看到,kube-controller-manager和kube-scheduler都没有被监控

![](https://s4.51cto.com/images/blog/202103/12/43cde5885d66d0c8b1446b178dd1534f.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)

解决
这是因为serviceMonitor是根据label去选取svc的,我们可以看到对应的serviceMonitor是选取的namespace范围是kube-system


1[root@k8s-master001 manifests]# grep -A2 -B2 selector serviceMonitor/prometheus-serviceMonitorKube* 2serviceMonitor/prometheus-serviceMonitorKubeControllerManager.yaml- matchNames: 3serviceMonitor/prometheus-serviceMonitorKubeControllerManager.yaml- - kube-system 4serviceMonitor/prometheus-serviceMonitorKubeControllerManager.yaml: selector: 5serviceMonitor/prometheus-serviceMonitorKubeControllerManager.yaml- matchLabels: 6serviceMonitor/prometheus-serviceMonitorKubeControllerManager.yaml- k8s-app: kube-controller-manager 7-- 8serviceMonitor/prometheus-serviceMonitorKubelet.yaml- matchNames: 9serviceMonitor/prometheus-serviceMonitorKubelet.yaml- - kube-system 10serviceMonitor/prometheus-serviceMonitorKubelet.yaml: selector: 11serviceMonitor/prometheus-serviceMonitorKubelet.yaml- matchLabels: 12serviceMonitor/prometheus-serviceMonitorKubelet.yaml- k8s-app: kubelet 13-- 14serviceMonitor/prometheus-serviceMonitorKubeScheduler.yaml- matchNames: 15serviceMonitor/prometheus-serviceMonitorKubeScheduler.yaml- - kube-system 16serviceMonitor/prometheus-serviceMonitorKubeScheduler.yaml: selector: 17serviceMonitor/prometheus-serviceMonitorKubeScheduler.yaml- matchLabels: 18serviceMonitor/prometheus-serviceMonitorKubeScheduler.yaml- k8s-app: kube-scheduler

2、创建kube-controller-manager和kube-scheduler  service
k8s v1.19默认使用https,kube-controller-manager端口10257 kube-scheduler端口10259
kube-controller-manager-scheduler.yml



1apiVersion: v1 2kind: Service 3metadata: 4 namespace: kube-system 5 name: kube-controller-manager 6 labels: 7 k8s-app: kube-controller-manager 8spec: 9 selector: 10 component: kube-controller-manager 11 type: ClusterIP 12 clusterIP: None 13 ports: 14 - name: https-metrics 15 port: 10257 16 targetPort: 10257 17 protocol: TCP 18--- 19apiVersion: v1 20kind: Service 21metadata: 22 namespace: kube-system 23 name: kube-scheduler 24 labels: 25 k8s-app: kube-scheduler 26spec: 27 selector: 28 component: kube-scheduler 29 type: ClusterIP 30 clusterIP: None 31 ports: 32 - name: https-metrics 33 port: 10259 34 targetPort: 10259 35 protocol: TCP

执行命令


1[root@k8s-master001 manifests]# kubectl apply -f kube-controller-manager-scheduler.yml 2[root@k8s-master001 manifests]# kubectl get svc -n kube-system 3NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 4kube-controller-manager ClusterIP None <none> 10257/TCP 37m 5kube-scheduler ClusterIP None <none> 10259/TCP 37m

3、创建kube-controller-manager和kube-scheduler endpoint
注意:addresses改成集群实际的IP
kube-ep.yml



1apiVersion: v1 2kind: Endpoints 3metadata: 4 labels: 5 k8s-app: kube-controller-manager 6 name: kube-controller-manager 7 namespace: kube-system 8subsets: 9- addresses: 10 - ip: 10.26.25.20 11 - ip: 10.26.25.21 12 - ip: 10.26.25.22 13 ports: 14 - name: https-metrics 15 port: 10257 16 protocol: TCP 17--- 18apiVersion: v1 19kind: Endpoints 20metadata: 21 labels: 22 k8s-app: kube-scheduler 23 name: kube-scheduler 24 namespace: kube-system 25subsets: 26- addresses: 27 - ip: 10.26.25.20 28 - ip: 10.26.25.21 29 - ip: 10.26.25.22 30 ports: 31 - name: https-metrics 32 port: 10259 33 protocol: TCP



1[root@k8s-master001 manifests]# kubectl apply -f kube-ep.yml 2endpoints/kube-controller-manager created 3endpoints/kube-scheduler created 4 5[root@k8s-master001 manifests]# kubectl get ep -n kube-system 6NAME ENDPOINTS AGE 7kube-controller-manager 10.26.25.20:10257,10.26.25.21:10257,10.26.25.22:10257 16m 8kube-scheduler 10.26.25.20:10259,10.26.25.21:10259,10.26.25.22:10259 16m

现在看下页面上prometheus target,已经能看到kube-controller-manager和kube-scheduler被监控了

![](https://s4.51cto.com/images/blog/202103/12/df1c4a1ea8639a3172e0acdd32828324.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
### 坑位二

1、默认清理下,kube-controller-manager和kube-scheduler绑定IP为127.0.0.1,如果需要监控这两个服务,需要修改kube-controller-manager和kube-scheduler配置,让其绑定到0.0.0.0
2、配置文件所在目录/etc/kubernetes/manifests
修改kube-controller-manager.yaml中--bind-address=0.0.0.0
修改kube-scheduler.yaml中--bind-address=0.0.0.0
3、重启kubelet:systemctl restart kubelet
4、查看是否生效,返回200即为成功



1[root@k8s-master002 manifests]# curl -I -k https://10.26.25.20:10257/healthz 2HTTP/1.1 200 OK 3Cache-Control: no-cache, private 4Content-Type: text/plain; charset=utf-8 5X-Content-Type-Options: nosniff 6Date: Tue, 15 Sep 2020 06:19:32 GMT 7Content-Length: 2 8 9[root@k8s-master002 manifests]# curl -I -k https://10.26.25.20:10259/healthz 10HTTP/1.1 200 OK 11Cache-Control: no-cache, private 12Content-Type: text/plain; charset=utf-8 13X-Content-Type-Options: nosniff 14Date: Tue, 15 Sep 2020 06:19:36 GMT 15Content-Length: 2

### 最后

-----



kube-prometheus配置很多,这里只是做了最基础的设置。更多需求请自行查看官方文档https://github.com/prometheus-operator/kube-prometheus

![](https://s4.51cto.com/images/blog/202103/12/9a05efce10a3296746a466dee68179c3.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
注:文中图片来源于网络,如有侵权,请联系我及时删除。