prometheus容器监控系统 (二)
准备实验
主机 | ip | 内存 | 系统 | 备注 |
Prometheus-server | 192.168.200.132 | 2G | centos7.5 | |
Prometheus-client1 | 192.168.200.147 | 1G | centos7.5 | |
Prometheus-client2 | 192.168.200.131 | 1G | centos7.5 | |
k8s-master | 192.168.200.132 | 4G | centos7.5 | 完全克隆磁盘40G.CPU2核及以上 |
k8s-node | 192.168.200.137 | 4G | centos7.5 | 完全克隆磁盘40G.CPU2核及以上 |
监控k8s
监控指标
kubernetes本身监控
- node资源利用率
- node数量
- pods数量
- 资源对象妆态
pod监控
- pod数量(项目)
- 容器资源利用率
- 应用程序
监控指标 | 具体实现 | eg |
node资源利用率 | node-exporter | 节点cpu,内存利用率 |
pod资源类利用率 | cadvisor | 容器cpu,内存利用率 |
k8s资源对象状态 | kube-state-metrics | pod/deployment/service |
服务发现类型 | 描述 |
node | 发现集群中的节点,默认地址为kubelet的http端口 |
service | 发现所有service及端口为目标 |
pod | 发现所有pod为目标 |
ednpoints | 从service列表中的endpoint发现pod为目标 |
ingress | 发现ingress路径为目标 |
简易部署要给k8s集群
给k8s集群修改主机名
###master
[root@k8s-master ~]# hostnamectl set-hostname k8s-master
###node
[root@k8s-node1 ~]# hostnamectl set-hostname k8s-node1
同步时间,并临时关闭swap
##在两台服务器上执行下边的操作
ntpdate ntp1.aliyun.com
swapoff -a
准备工作两边都一样
###两边都是一样的
[root@k8s-master ~]# vim /etc/hosts
192.168.200.142 k8s-master
192.168.200.141 k8s-node1
[root@k8s-master ~]# vim /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward=1
sysctl --system
安装Docker/kubeadm/kubelet【在所有的节点上】
wget https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo -O /etc/yum.repos.d/docker-ce.repo
yum -y install docker-ce
systemctl enable docker && systemctl start docker
vim /etc/docker/daemon.json
{
"registry-mirrors": ["https://b9pmyelo.mirror.aliyuncs.com"]
}
systemctl restart docker
vim /etc/yum.repos.d/kubernets.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=0
repo_gpgcheck=0
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
yum install -y kubelet-1.18.0 kubeadm-1.18.0 kubectl-1.18.0
systemctl enable kubelet
部署k8s-master节点
kubeadm init \
--apiserver-advertise-address=192.168.200.142 \
--image-repository registry.aliyuncs.com/google_containers \
--kubernetes-version v1.18.0 \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16 \
--ignore-preflight-errors=all
###出现一下字样ok ,网不好可能的需要漫长的等待
kubeadm join 192.168.200.142:6443 --token tvk2tg.8ml2hqzhvdfphh1t \
--discovery-token-ca-cert-hash sha256:1f5447a7a9b050dda3e3c5692a523e1ac0c5146a5dd14aa47081e93ef05b055c
[root@k8s-master yum.repos.d]# echo "$?"
0
[root@k8s-master ~]# mkdir -p $HOME/.kube
[root@k8s-master ~]# cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@k8s-master ~]# chown $(id -u):$(id -g) $HOME/.kube/config
[root@k8s-master ~]# ll .kube/config
-rw------- 1 root root 5451 6月 5 23:07 .kube/config
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master NotReady master 3m23s v1.18.0
####
--apiserver-advertise-address 集群通告地址
--image-repository 由于默认拉取镜像地址k8s.gcr.io国内无法访问,这里指定阿里云镜像仓库地址
--kubernetes-version K8s版本,与上面安装的一致
--service-cidr 集群内部虚拟网络,Pod统一访问入口
--pod-network-cidr Pod网络,,与下面部署的CNI网络组件yaml中保持一致
部署node节点
kubeadm join 192.168.200.142:6443 --token tvk2tg.8ml2hqzhvdfphh1t \
--discovery-token-ca-cert-hash sha256:1f5447a7a9b050dda3e3c5692a523e1ac0c5146a5dd14aa47081e93ef05b055c ##这里的验证就是我们在master端部署的时候,给的验证。不要抄错了
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
###在master端查看
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master NotReady master 6m40s v1.18.0
k8s-node1 NotReady <none> 76s v1.18.0
加载网络插件(master和node都需要)
wget https://docs.projectcalico.org/v3.8/manifests/calico.yaml
[root@k8s-master ~]# sed -i -e "s?192.168.0.0/16?10.244.0.0/16?g" calico.yaml
[root@k8s-master ~]# kubectl apply -f calico.yaml
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
[root@k8s-master ~]# kubectl get nodes ###有的时候需要等会
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 13m v1.18.0
创建一个可被监控的节点
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 3d9h v1.18.0
[root@k8s-master ~]# kubectl create deployment web --image=nginx
deployment.apps/web created
[root@k8s-master ~]# kubectl get pods ###由于是主节点所以时集群被污点,不会去分配,我们把污点去带哦
NAME READY STATUS RESTARTS AGE
web-5dcb957ccc-kpvdr 0/1 Pending 0 28s
[root@k8s-master ~]# kubectl describe node | grep Taint 过滤污点
Taints: node-role.kubernetes.io/master:NoSchedule
[root@k8s-master ~]# kubectl taint node k8s-master node-role.kubernetes.io/master- ###将节点的污点去掉
node/k8s-master untainted
[root@k8s-master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
web-5dcb957ccc-kpvdr 1/1 Running 0 4m59s
[root@k8s-master ~]# kubectl get pods -n kube-system ###在kube-system启动的相关组件
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-75d555c48-vs97h 1/1 Running 0 3d8h
calico-node-52lr6 1/1 Running 0 3d8h
coredns-7ff77c879f-7xfmt 1/1 Running 0 3d9h
coredns-7ff77c879f-mcl9f 1/1 Running 0 3d9h
etcd-k8s-master 1/1 Running 0 3d9h
kube-apiserver-k8s-master 1/1 Running 0 3d9h
kube-controller-manager-k8s-master 1/1 Running 0 3d9h
kube-proxy-xhgk5 1/1 Running 0 3d9h
kube-scheduler-k8s-master 1/1 Running 0 3d9h
Prometheus监控pod资源
[root@k8s-master ~]# kubectl get ep
NAME ENDPOINTS AGE
kubernetes 192.168.200.142:6443 3d9h ###Prometheus需要访问这个api的接口,但是6443是不会轻易被开放的所以需要验证。
###进行rbac授权
[root@k8s-master ~]# vim rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-system
[root@k8s-master ~]# kubectl apply -f rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created ###出现这三行创建成功
###下面获取授权的token
[root@k8s-master ~]# kubectl get sa -n kube-system | grep promethe
prometheus 1 3m50s ###这个是我们刚刚授权的文件
[root@k8s-master ~]# kubectl get sa prometheus -n kube-system -o yaml ###导出这个资源的yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"name":"prometheus","namespace":"kube-system"}}
creationTimestamp: "2021-06-09T13:02:59Z"
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:secrets:
.: {}
k:{"name":"prometheus-token-f8s8g"}:
.: {}
f:name: {}
manager: kube-controller-manager
operation: Update
time: "2021-06-09T13:02:59Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
manager: kubectl
operation: Update
time: "2021-06-09T13:02:59Z"
name: prometheus
namespace: kube-system
resourceVersion: "18462"
selfLink: /api/v1/namespaces/kube-system/serviceaccounts/prometheus
uid: 26fda85a-70d3-46d2-9a74-6869fdb1c9fc
secrets:
- name: prometheus-token-f8s8g ####name后边的值就是我们授权的token的名称
[root@k8s-master ~]# kubectl describe secret prometheus-token-f8s8g -n kube-system ###拿到我们的授权,下面的token后边就是授权密码
Name: prometheus-token-f8s8g
Namespace: kube-system
Labels: <none>
Annotations: kubernetes.io/service-account.name: prometheus
kubernetes.io/service-account.uid: 26fda85a-70d3-46d2-9a74-6869fdb1c9fc
Type: kubernetes.io/service-account-token
Data
====
ca.crt: 1025 bytes
namespace: 11 bytes
token: eyJhbGciOiJSUzI1NiIsImtpZCI6Inl2Y3JqOXBOUkNLajRFZ0FHR2dhYTFra1d4X2VjUzRYbGp5Q3VXUnY2V28ifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLXRva2VuLWY4czhnIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InByb21ldGhldXMiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiIyNmZkYTg1YS03MGQzLTQ2ZDItOWE3NC02ODY5ZmRiMWM5ZmMiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06cHJvbWV0aGV1cyJ9.uh9IiJWbvwkOT9Jos9Jtzo3n4k2KGpEfm8vpO5PefxmHztlsGe2UIXKaJoF1E-HWjedv4fHVH2_ATMegqoZgRXdC2HUnPO-HmOooC5A561KYj-9Mru-MMYfQPA-2zshjBsbLU2ngM0GM1d9B-ec_Aq3tHTQT1LF_sdpciLNmcoTjQwE1zK76q-L1X97lPMwopWTZefQs93Y86Z-c3kufAFx3MOE2iwMI--EDorjjMENnjvu3I-zZmU-i_c4PeBVlwA_AZfUvuPRMYfkvY7e9cMbqLkkRvjaiPxDe_6Amr4l4kzIvJkC4946_L8Q9QD1KdC7W6YPShpW40ZpMbU10cg
[root@k8s-master ~]# kubectl describe secret prometheus-token-f8s8g -n kube-system > token.k8s ###将值导出到文件并且发送到Prometheus的server端一会要使用这个token来访问
[root@k8s-master ~]# scp token.k8s 192.168.200.132:/root/
The authenticity of host '192.168.200.132 (192.168.200.132)' can't be established.
ECDSA key fingerprint is SHA256:DaZ1qd1UDlXKAYiTYk5ZBjwvWxkEwQJmHew7PIK78wA.
ECDSA key fingerprint is MD5:79:cb:6d:09:0e:03:e8:58:3b:e4:81:88:da:07:7d:9f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.200.132' (ECDSA) to the list of known hosts.
root@192.168.200.132's password:
token.k8s
###在Prometheus服务器上操作
[root@bogon ~]# mv token.k8s /usr/local/prometheus/
[root@bogon ~]# cd /usr/local/prometheus/
[root@bogon prometheus]# ls
console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool sd_config token.k8s
[root@bogon prometheus]# vim token.k8s ##修改这个文件使其只剩下token的值
eyJhbGciOiJSUzI1NiIsImtpZCI6Inl2Y3JqOXBOUkNLajRFZ0FHR2dhYTFra1d4X2VjUzRYbGp5Q3VXUnY2V28ifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLXRva2VuLWY4czhnIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6InByb21ldGhldXMiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiIyNmZkYTg1YS03MGQzLTQ2ZDItOWE3NC02ODY5ZmRiMWM5ZmMiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06cHJvbWV0aGV1cyJ9.uh9IiJWbvwkOT9Jos9Jtzo3n4k2KGpEfm8vpO5PefxmHztlsGe2UIXKaJoF1E-HWjedv4fHVH2_ATMegqoZgRXdC2HUnPO-HmOooC5A561KYj-9Mru-MMYfQPA-2zshjBsbLU2ngM0GM1d9B-ec_Aq3tHTQT1LF_sdpciLNmcoTjQwE1zK76q-L1X97lPMwopWTZefQs93Y86Z-c3kufAFx3MOE2iwMI--EDorjjMENnjvu3I-zZmU-i_c4PeBVlwA_AZfUvuPRMYfkvY7e9cMbqLkkRvjaiPxDe_6Amr4l4kzIvJkC4946_L8Q9QD1KdC7W6YPShpW40ZpMbU10cg
[root@bogon prometheus]#cd ~
[root@bogon ~]# rz -E
rz waiting to receive.
[root@bogon ~]# ls
anaconda-ks.cfg prometheus-2.25.2.linux-amd64.tar.gz
grafana-7.3.1.linux-amd64.tar.gz prometheus.yml
[root@bogon ~]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
--------------------------------------------------------------------第一步分复制,把#号去了
- job_name: kubernetes-nodes-cadvisor
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: node
api_server: https://192.168.200.142:6443
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
relabel_configs:
# 将标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_node_label_(.*)
# 修改NodeIP:10250为APIServerIP:6443
- action: replace
regex: (.*)
source_labels: ["__address__"]
target_label: __address__
replacement: 192.168.200.142:6443
# 实际访问指标接口 https://NodeIP:10250/metrics/cadvisor 这个接口只能APISERVER访问,故此重新标记标签使用APISERVER代理访问
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
regex: (.*)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
------------------------------------------第一部分复制把#号去了
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
api_server: https://192.168.200.142:6443
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
# Service没配置注解prometheus.io/scrape的不采集
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
# 重命名采集目标协议
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
# 重命名采集目标指标URL路径
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
# 重命名采集目标地址
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
# 将K8s标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
# 生成命名空间标签
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
# 生成Service名称标签
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_service_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
api_server: https://192.168.200.142:6443
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
# 重命名采集目标协议
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
# 重命名采集目标指标URL路径
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
# 重命名采集目标地址
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
# 将K8s标签(.*)作为新标签名,原有值不变
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# 生成命名空间标签
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
# 生成Service名称标签
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
-----------------------以上是第二部分
################最后在Prometheus的配置文件里添加下面的内容
[root@bogon prometheus]# pwd
/usr/local/prometheus
[root@bogon prometheus]# vim prometheus.yml
- job_name: kubernetes-nodes-cadvisor
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- role: node
api_server: https://192.168.200.142:6443
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.*)
- action: replace
regex: (.*)
source_labels: ["__address__"]
target_label: __address__
replacement: 192.168.200.142:6443
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
regex: (.*)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
[root@bogon prometheus]# ./promtool check config ./prometheus.yml 检查配置文件的语法是否有错
Checking ./prometheus.yml
SUCCESS: 0 rule files found
[root@bogon prometheus]# pgrep prome
622
[root@bogon prometheus]# kill -HUP 622 ##热启动一下
监控k8s对象资源
需要在k8s节点上部署服务
[root@k8s-master ~]# rz -E
rz waiting to receive.
[root@k8s-master ~]# ls
anaconda-ks.cfg calico.yaml kube-state-metrics.yaml rbac.yaml token.k8s
[root@k8s-master ~]# kubectl apply -f kube-state-metrics.yaml
serviceaccount/kube-state-metrics created
clusterrole.rbac.authorization.k8s.io/kube-state-metrics created
role.rbac.authorization.k8s.io/kube-state-metrics-resizer created
clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created
rolebinding.rbac.authorization.k8s.io/kube-state-metrics created
deployment.apps/kube-state-metrics created
configmap/kube-state-metrics-config created
service/kube-state-metrics created
[root@k8s-master ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-75d555c48-vs97h 1/1 Running 0 3d10h
calico-node-52lr6 1/1 Running 0 3d10h
coredns-7ff77c879f-7xfmt 1/1 Running 0 3d10h
coredns-7ff77c879f-mcl9f 1/1 Running 0 3d10h
etcd-k8s-master 1/1 Running 0 3d10h
kube-apiserver-k8s-master 1/1 Running 0 3d10h
kube-controller-manager-k8s-master 1/1 Running 0 3d10h
kube-proxy-xhgk5 1/1 Running 0 3d10h
kube-scheduler-k8s-master 1/1 Running 0 3d10h
kube-state-metrics-866f97f7fb-nzl7f 2/2 Running 0 2m7s
###最后这个是启动的服务 如果是容器创建种的状态不要急,这个正在拉取镜像
修改Prometheus的配置文件
###在Prometheus的服务器上
[root@bogon prometheus]# vim prometheus.yml
将第二部分复制进来
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
api_server: https://192.168.200.142:6443
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_service_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
api_server: https://192.168.200.142:6443
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
bearer_token_file: /usr/local/prometheus/token.k8s
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
[root@bogon prometheus]# ./promtool check config ./prometheus.yml
Checking ./prometheus.yml
SUCCESS: 0 rule files found
[root@bogon prometheus]# kill -HUP 622
解决办法1
###在两台服务器上把ipv4转发开启,如果不开启过不了
[root@bogon prometheus]# ip route add 10.244.0.0/16 via 192.168.200.142 dev ens32
[root@bogon prometheus]# ip route
default via 192.168.200.2 dev ens32
10.244.0.0/16 via 192.168.200.142 dev ens32
169.254.0.0/16 dev ens32 scope link metric 1002
192.168.200.0/24 dev ens32 proto kernel scope link src 192.168.200.132 ###这个问题是需要网管在路由器上解决,如果过了就ok,不过也没办法
promql使用
是Prometheus自己开发的一个类似sql'的语言,可以自己各维度的查询
“up” 在Prometheus的上面执行个语句,可以直接查询所有实例节点的状态
###查询指标的最新样本(瞬时向量:)
node_cpu_seconds_total
node_cpu_serconds_total{job="你想看的job名字"}
###可以支持时间单位 s,m,h,d,w,y(加了时间以后叫范围查询)
node_cpu_serconds_total{job="你想看的job名字"}[5m]
node_cpu_serconds_total{job="你想看的job名字"}[1h]
常用操作符
比较操作符
• = 等于
• != 不等于
• > 大于
• < 小于
• >= 大于等于
• <= 小于等于
node_cpu_seconds_total{job="Linuxweb",mode="iowait"}
node_cpu_seconds_total{job="Linuxweb",mode=~"user|system"}
node_cpu_seconds_total{job="Linuxweb",mode=~"user|system",cpu!="0"}
算术操作符
• + 加法
• - 减法
• * 乘法
• / 除法
CPU使用率:
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
内存使用率:
100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) /
node_memory_MemTotal_bytes * 100
正则匹配操作符
• =~ 正则表达式匹配 !~ 正则表达式匹配结果取反
磁盘使用率:
100 - (node_filesystem_free_bytes{mountpoint="/",fstype=~"ext4|xfs"} /
node_filesystem_size_bytes{mountpoint="/",fstype=~"ext4|xfs"} * 100)
聚合操作符
• sum (在维度上求和)
• max (在维度上求最大值)
• min (在维度上求最小值)
• avg (在维度上求平均值)
• count(统计样本数量)
• irate (在一定时间内的转变速率)
所有实例CPU system使用率总和:
sum(node_cpu_seconds_total{job="Linuxweb",mode="system"})
所有实例CPU system变化速率平均值:
avg(irate(node_cpu_seconds_total{job="Linuxweb",mode="system"}[5m])
统计CPU数量:
count(node_cpu_seconds_total{job="Linuxweb",mode="system"})
逻辑操作符 • and 与 or 或
大于10并且小于50:
prometheus_http_requests_total > 10 and prometheus_http_requests_total < 50
大于10或者小于50:
prometheus_http_requests_total > 10 or prometheus_http_requests_total < 50
监控指标标签管理
标签的作用,是通过标签去分组,来查看不同的数据
在Prometheus所有的Target实例中,都包含一些默认的Metadata标签信息。可以通过Prometheus UI的
Targets页面中查看这些实例的Metadata标签的内容:
• address:当前Target实例的访问地址:
• scheme:采集目标服务访问地址的HTTP Scheme,HTTP或者HTTPS
• metrics_path:采集目标服务访问地址的访问路
##先把配置文件备份一份,然后精简下,把不需要的k8s给去掉。
[root@bogon prometheus]# cp prometheus.yml{,.bak}
[root@bogon prometheus]# ls
console_libraries data NOTICE prometheus.yml promtool token.k8s
consoles LICENSE prometheus prometheus.yml.bak sd_config
####对之前的监控进行加标签修改:
[root@bogon prometheus]# vim prometheus.yml
- job_name: 'docker'
static_configs:
- targets: ['192.168.200.131:8080']
labels: ##这行表示启用标签
idc: jiuxianqiao ###第一个标签叫idc标签的值是酒仙桥
project: www ####第二个标签叫project标签的值是www
- job_name: 'mysqldb'
static_configs:
- targets: ['192.168.200.131:9104']
labels:
idc: zhaowei
project: wordpress
[root@bogon prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found
[root@bogon prometheus]# kill -HUP 622
重新标记标签
重新标记目的:为了更好的标识监控指标。
在两个阶段可以重新标记:
• relabel_configs :在采集之前
• metric_relabel_configs:在存储之前
准备抓取指标数据时,可以使用relabel_configs添加一些标签、也可以只采集特定目标或过滤目标。
已经抓取到指标数据时,可以使用metric_relabel_configs做最后的重新标记和过滤。
重新标记标签一般用途:
• 动态生成新标签
• 过滤采集的Target
• 删除不需要或者敏感标签
• 添加新标
action:重新标记标签动作
• replace:默认,通过regex匹配source_label的值,使用replacement来
引用表达式匹配的分组,分组使用2...引用
• keep:删除regex与连接不匹配的目标 source_labels
• drop:删除regex与连接匹配的目标 source_labels
• labeldrop:删除regex匹配的标签
• labelkeep:删除regex不匹配的标签
• labelmap:匹配regex所有标签名称,并将捕获的内容分组,用第一个分
组内容作为新的标签
通配引用
###修改配置文件进行新的标签规则书写
[root@bogon prometheus]# vim prometheus.yml
- job_name: 'docker'
static_configs:
- targets: ['192.168.200.131:8080']
labels:
idc: jiuxianqiao
project: www
relabel_configs: ###表示进行标签重写
- action: replace ### 动作声明
source_labels: ["__address__"] ###原来是那个标签
regex: (.*):([0-9]+) ####正则进行切割
replacement: $1 ####反向应用那个标签
target_label: "ip" #####新标签的名字叫啥
[root@bogon prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found
[root@bogon prometheus]# kill -HUP 622
过滤目标采集
###修改配置文件
[root@bogon prometheus]# vim prometheus.yml
- job_name: 'linuxweb'
basic_auth:
username: yunjisuan
password: 123456
static_configs:
- targets: ['192.168.200.147:9100','192.168.200.131:9100']
relabel_configs:
- action: drop
regex: "192.168.200.131.*"
source_labels: ["__address__"]
[root@bogon prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found
prometheus 报警
部署Alertmanage
###在Prometheusserver端
[root@bogon ~]# ls ##将源码保拉进来并且解压
alertmanager-0.21.0.linux-amd64.tar.gz grafana-7.3.1.linux-amd64.tar.gz prometheus.yml
anaconda-ks.cfg prometheus-2.25.2.linux-amd64.tar.gz
[root@bogon ~]# tar xf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
[root@bogon ~]# mv /usr/local/alertmanager-0.21.0.linux-amd64/ /usr/local/alertmanager
[root@bogon ~]# cd /usr/local/alertmanager/
[root@bogon alertmanager]# ls
alertmanager alertmanager.yml amtool LICENSE NOTICE
[root@bogon alertmanager]# cp /usr/lib/systemd/system/grafana.service /usr/lib/systemd/system/alertmanager.service
[root@bogon alertmanager]# vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
[Service]
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
[root@bogon alertmanager]# systemctl daemon-reload
[root@bogon alertmanager]# systemctl start alertmanager
[root@bogon alertmanager]# systemctl enable alertmanager
Created symlink from /etc/systemd/system/multi-user.target.wants/alertmanager.service to /usr/lib/systemd/system/alertmanager.service.
[root@bogon alertmanager]# ps -ef | grep alertmanager
root 5428 1 0 21:20 ? 00:00:00 /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
root 5453 5357 0 21:20 pts/0 00:00:00 grep --color=auto alertmanager
[root@bogon alertmanager]# ss -antup | grep 9093
tcp LISTEN 0 128 :::9093 :::* users:(("alertmanager",pid=5428,fd=8))
###告诉Prometheus alert在哪了
[root@bogon alertmanager]# vim /usr/local/prometheus/prometheus.yml ###修改下边的内容,其他不用变
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
[root@bogon alertmanager]# cd /usr/local/prometheus/
[root@bogon prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found
[root@bogon prometheus]# kill -HUP 622
配置alter的配置文件
[root@bogon prometheus]# cd /usr/local/alertmanager/
[root@bogon alertmanager]# ls
alertmanager alertmanager.yml amtool LICENSE NOTICE
[root@bogon alertmanager]# vim alertmanager.yml
###必须是163邮箱要不报错
global:
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '18910670931@163.com'
smtp_auth_username: '18910670931@163.com'
smtp_auth_password: 'QVIKAAPBLVIHHWUS'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 10m
receiver: default-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: '18910670931@163.com'
------------------------注释
global:
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '18910670931@163.com'
smtp_auth_username: '18910670931@163.com'
smtp_auth_password: 'QVIKAAPBLVIHHWUS' ###第三方授权码,在qq邮箱获取
smtp_require_tls: false
# 配置路由树
route:
group_by: ['alertname'] # 根据告警规则组名进行分组
group_wait: 30s # 分组内第一个告警等待时间,10s内如有第二个告警会合并一个告警
group_interval: 5m # 发送新告警间隔时间
repeat_interval: 10m # 重复告警间隔发送时间
receiver: default-receiver ###需要发送给那个接收人
# 接收人
receivers:
- name: 'default-receiver'
email_configs:
- to: '18910670931@163.com'
html: ''
headers: { Subject: "[WARN] 报警邮件 test" }
[root@bogon alertmanager]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 0 inhibit rules
- 1 receivers
- 0 templates
[root@bogon alertmanager]# systemctl restart alertmanager
在Prometheus种创建告警规则
[root@bogon alertmanager]# cd ../prometheus/
[root@bogon prometheus]# vim prometheus.yml
15 rule_files:
16 - "rules/*.yml"
[root@bogon prometheus]# mkdir rules
[root@bogon prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 0 rule files found
[root@bogon prometheus]# kill -HUP 622
[root@bogon prometheus]# vim rules/node.yml
groups:
- name: alert-rules.yml
rules:
- alert: InstanceStatus
expr: up == 0
for: 10s
labels:
severity: "critical"
annotations:
description: " {{ $labels.instance }} 停止工作"
summary: "{{ $labels.instance }}:job {{ $labels.job }} 已经停止20s以上."
[root@bogon prometheus]# ./promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 1 rule files found
Checking rules/node.yml
SUCCESS: 1 rules found
-------------------注释
groups:
- name: alert-rules.yml
rules:
- alert: InstanceStatus # alert 名字
expr: up == 0 # 判断条件
for: 10s # 条件保持 10s 才会发出 alter
labels: # 设置 alert 的标签
severity: "critical"
annotations: # alert 的其他标签,但不用于标识 alert
description: " {{ $labels.instance }} 停止工作"
summary: "{{ $labels.instance }}:job {{ $labels.job }} 已经停止20s以上."
告警状态
• Inactive:这里什么都没有发生。
• Pending:已触发阈值,但未满足告警持续时间
• Firing:已触发阈值且满足告警持续时间。警报发送给接受者
告警一直(由于太复杂不常用,写个示例,就不做演示)
# 抑制规则
inhibit_rules:
- source_match: ###匹配来源,包含有现编的标签的
severity: 'high'
target_match:
severity: 'warning' ###需要抑制的标签
equal: ['alertname', 'dev', 'instance'] #####进行更深度的匹配
+