k8s部署kube-prometheus
k8s部署kube-prometheus
目录
前言
环境:centos7.9、k8s-v1.22.6、kube-prometheus-release-0.10.zip
说明
我们采用prometheus-operator
的方式在k8s集群上安装prometheus
监控软件,这个项目的软件在GitHub上面;
github 上 coreos 下有两个项目:kube-prometheus 和 prometheus-operator
,两者都可以实现 prometheus
的创建及管理。
需要注意的是,kube-prometheus
上的配置操作也是基于 prometheus-operator
的,并提供了大量的默认配置,故这里使用的是 kube-prometheus
项目的配置。
另外使用前需注意 k8s 版本要求,找到对应的 kube-prometheus
版本。
下载 kube-prometheus 安装包
去https://github.com/prometheus-operator/kube-prometheus/tree/release-0.10
下载 kube-prometheus-release-0.10.zip
压缩包,上传到服务器。
mkdir /root/kube-prometheus
cd /root/kube-prometheus
unzip kube-prometheus-release-0.10.zip
cd /root/kube-prometheus/kube-prometheus-release-0.10
#查看这些yaml文件需要哪些镜像
find ./manifests -type f | xargs grep 'image: '|sort|uniq |awk '{print $3}'|grep ^[a-zA-Z]grep -Evw
quay.io/prometheus/alertmanager:v0.23.0
jimmidyson/configmap-reload:v0.5.0
quay.io/brancz/kube-rbac-proxy:v0.11.0
quay.io/prometheus/blackbox-exporter:v0.19.0
grafana/grafana:8.3.3
k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
quay.io/brancz/kube-rbac-proxy:v0.11.0
quay.io/brancz/kube-rbac-proxy:v0.11.0
quay.io/prometheus/node-exporter:v1.3.1
k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
quay.io/brancz/kube-rbac-proxy:v0.11.0
quay.io/prometheus-operator/prometheus-operator:v0.53.1
quay.io/prometheus/prometheus:v2.32.1
# 可以先下载镜像,发现其中有2个镜像一直拉取不下来,所以找了下面的方法解决:
# 失败的镜像
k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
# 平替镜像
docker pull lbbi/prometheus-adapter:v0.9.1
docker pull bitnami/kube-state-metrics:latest
# 打标签替换
docker tag lbbi/prometheus-adapter:v0.9.1 k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
docker tag bitnami/kube-state-metrics:latest k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
#正式安装
kubectl apply --server-side -f manifests/setup
kubectl wait \
--for condition=Established \
--all CustomResourceDefinition \
--namespace=monitoring
kubectl apply -f manifests/
# 解读
# manifests/setup/目录下是创建monitoring命名空间和创建自定义资源CRD的yaml资源清单文件
# manifests/目录下有大量的yaml文件,这些文件就是用于创建prometheus组件的,如sts/deloyment/servicemonitors/svc/alertmanagers/prometheusrules等待资源文件
#--server-side选项告诉Kubernetes在服务器端执行操作,而不是在客户端,通常用于确保集群状态与配置文件一致
#kubectl wait 命令等待所有CustomResourceDefinition资源在指定的monitoring命名空间中达到Established状态
暴露Prometheus的svc端口
安装好prometheus
的service
是ClusterIP
类型,所以为了能从外网访问,这里简单的已修改为NodePort
类型(生产环境根据实际情况将端口暴露出去),然后从浏览器访问:
# 修改svc类型为NodePort
kubectl edit svc prometheus-k8s -n monitoring
kubectl edit svc grafana -n monitoring
kubectl edit svc alertmanager-main -n monitoring
#查看service暴露的NodePort端口:
kubectl get svc -n monitoring | grep -i nodeport
访问prometheus
http://192.168.118.133:32539 #访问prometheus
http://192.168.118.133:31757 #访问grafana,默认已经监控起来k8s集群了,默认账号密码 admin/admin
http://192.168.118.133:31206 #访问alertmanager
访问grafana
浏览器访问grafana: http://192.168.118.133:31757 admin/admin
左侧方格图标-Browse-Default目录下已经内置了很多dashboard。
配置prometheus、grafana数据持久化
参考另一篇文章:https://blog.csdn.net/MssGuo/article/details/127891331
prometheus数据持久化
[root@matser manifests]# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
........................
prometheus-k8s-0 2/2 Running 0 68m #这个就是真正在跑prometheus-server的pod
prometheus-k8s-1 2/2 Running 0 68m #这个就是真正在跑prometheus-server的pod
#由此知prometheus的主服务其实就是使用pod启动的,其实就是prometheus-k8s-0、prometheus-k8s-1
[root@matser manifests]# kubectl get sts -n monitoring
NAME READY AGE
alertmanager-main 3/3 69m
prometheus-k8s 2/2 69m #这个
# 但是我们查了一遍,发现当初创建kube-prometheus的yaml文件里面并没有创建StatefulSet资源,很奇怪。
# 后来百度发现,其实官方定义了一种叫做prometheus的资源,该资源创建了StatefulSet,如下:
[root@matser manifests]# kubectl get prometheus -n monitoring
NAME VERSION REPLICAS AGE
k8s 2.32.1 2 71m #这种prometheus资源,由其创建的sts,创建k8s资源的文件是prometheus-prometheus.yaml
[root@matser manifests]#
# prometheus-server 默认情况下没有配置数据持久化。所以我们需要做持久化存储。
#在线修改prometheus这种类型的资源,名称叫k8s(当然,你也可以修改prometheus-prometheus.yaml然后在apply亦可)
[root@matser manifests]# kubectl edit prometheus k8s -n monitoring
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
.....................
replicas: 2
resources:
requests:
memory: 400Mi
retention: 15d #加这个参数,表示prometheus数据保留的天数,默认会是1天而已
ruleNamespaceSelector: {}
............
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
storage: #加上下面这一段,一共9句
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50M
storageClassName: nfs-storageclass #指定已有的存储类名称
version: 2.32.1
[root@matser manifests]#
# 查看pvc,pv都创建好了
kubectl get pvc,pv -n monitoring
grafana的数据持久化
[root@matser ~]# kubectl get deploy -n monitoring
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/grafana 1/1 1 1 110m
#以上我们看到使用deployment部署了grafana,只有一个pod。
#手动创建grafana的pvc
[root@matser manifests]# cat > grafana-pvc.yaml <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100M
storageClassName: nfs-storageclass
EOF
kubectl apply -f grafana-pvc.yaml
kubectl get pvc -n monitoring | grep grafana
kubectl get pv | grep grafana
#编辑grafana的yaml文件,当然你可以在线edit它
[root@matser manifests]# vim grafana-deployment.yaml
volumes:
# - emptyDir: {} #注释或删掉
# name: grafana-storage #注释或删掉
- name: grafana-storage #注意名字要和上面那句的名称相同
persistentVolumeClaim:
claimName: grafana-pvc #指定我们刚才创建的pvc
- name: grafana-datasources #这个原来的,不管
secret:
secretName: grafana-datasources
#为了固定grafana的登陆密码,可以添加环境变量
readinessProbe:
httpGet:
path: /api/health
port: http
env: #添加环境变量
- name: GF_SECURITY_ADMIN_USER #添加环境变量
value: admin #添加环境变量
- name: GF_SECURITY_ADMIN_PASSWORD #添加环境变量
value: admin #添加环境变量
resources:
limits:
cpu: 200m
memory: 200Mi
kubectl replace -f grafana-deployment.yaml
问题处理
访问下面这个url,发现没有kube-controller-manager
和kube-schedule
。
进入我们的http://192.168.118.133:31757
,访问grafana,发现grafana也是没有监控到kube-controller-manager
和kube-schedule
这两个集群组件。
解决prometheus没有监控到kube-controller-manager和kube-schedule的问题
原因分析:
和prometheus
定义的ServiceMonitor
的资源有关。
我们查看安装目录下的资源清单文件,kube-scheduler
组件对应的资源清单发现:
#进入到我们安装prometheus的目录下
[root@master ~]# cd /root/kube-prometheus/kube-prometheus-release-0.10/manifests
#查看这个kubernetesControlPlane-serviceMonitorKubeScheduler.yaml资源清单
# 这种prometheus自定义资源ServiceMonitor类型,其标签选择器是一个svc
[root@master manifests]# cat kubernetesControlPlane-serviceMonitorKubeScheduler.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: kube-scheduler
app.kubernetes.io/part-of: kube-prometheus
name: kube-scheduler
namespace: monitoring
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
port: https-metrics
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
namespaceSelector: #命名空间选择器
matchNames:
- kube-system
selector: #标签选择器
matchLabels:
app.kubernetes.io/name: kube-scheduler
#发现上面定义的命名空间下的标签选择器,根本就没有匹配的上的对应的svc
[root@master manifests]# kubectl get svc -l app.kubernetes.io/name=kube-scheduler -n kube-system
No resources found in kube-system namespace. #没有svc资源匹配的上
[root@master manifests]#
#我们再来查看对应的controller-manager资源清单看看,也是同样的问题,没有svc资源匹配的上
[root@master manifests]# tail kubernetesControlPlane-serviceMonitorKubeControllerManager.yaml
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: kube-controller-manager
#发现根本就没有对应的service
[root@master manifests]# kubectl get svc -l app.kubernetes.io/name=kube-controller-manager -n kube-system
No resources found in kube-system namespace.
[root@master manifests]#
解决办法:手动创建对应的svc,让prometheus定义的ServiceMonitor资源有对应的service,而service又通过标签选择器关联着pod
#所以现在我们需要主动给它创建一个svc来让他能监控到kube-scheduler
[root@master ~]# cd /root/kube-prometheus/kube-prometheus-release-0.10/manifests/repair-prometheus
[root@master repair-prometheus]# cat kubeSchedulerService.yaml
apiVersion: v1
kind: Service
metadata:
labels: #定义这个service的标签,因为kubernetesControlPlane-serviceMonitorKubeScheduler.yaml里面定义了这个标签
app.kubernetes.io/name: kube-scheduler
name: kube-scheduler
namespace: kube-system #名称空间是kube-system
spec:
selector: #这个标签选择器表示我们要关联到kube-scheduler的pod上
component: kube-scheduler # kubectl get pods kube-scheduler-master -n kube-system --show-labels
ports:
- name: https-metrics #service端口名称,这个名称要与ServiceMonitor的port名称一致
port: 10259
targetPort: 10259 #kube-scheduler-master的端口
[root@master repair-prometheus]#
#同理,我们也需要主动给它创建一个svc来让他能监控到controller-manager
[root@master repair-prometheus]# cat kubeControllermanagerService.yaml
apiVersion: v1
kind: Service
metadata:
labels: #定义这个service的标签,因为kubernetesControlPlane-serviceMonitorKubeControllerManager.yaml里面定义了这个标签
app.kubernetes.io/name: kube-controller-manager
name: kube-controller-manager
namespace: kube-system #名称空间是kube-system
spec:
selector: #这个标签选择器表示我们要关联到kube-controller-manager-master的pod上
component: kube-controller-manager #kubectl get pods kube-controller-manager-master -n kube-system --show-labels
ports:
- name: https-metrics #service端口名称,这个名称要与ServiceMonitor的port名称一致
port: 10257
targetPort: 10257 # kube-controller-manager-master的pod的端口
[root@master repair-prometheus]#
# 创建上面两个service
[root@master repair-prometheus]# kubectl apply -f kubeSchedulerService.yaml -f kubeControllermanagerService.yaml
创建service之后,我们发现grafana的界面还是没有监控到Scheduler和Controllermanager服务,还有一步:
#还有一点,kube-scheduler-master和kube-controller-manager这2个pod启动的时候默认绑定的地址是127.0.0.1,所以普罗米修斯通过ip去访问
# 就会被拒绝,所以需要修改一下,我们知道这2个系统组件是是以静态pod的方式启动的,所以进入到master节点的静态pod目录
# 如果我们不指定静态pod目录时在哪里,可以通过kubelet查看
[root@master manifests]# systemctl status kubelet.service | grep '\-\-config'
└─429488 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.5
[root@master manifests]# grep static /var/lib/kubelet/config.yaml
staticPodPath: /etc/kubernetes/manifests #这就是静态pod的目录
[root@master manifests]#
[root@master ~]# cd /etc/kubernetes/manifests
[root@master manifests]# grep 192 kube-scheduler.yaml
- --bind-address=192.168.118.131 #修改127.0.0.1为主机的ip,修改为0.0.0.0也行
host: 192.168.118.131 #其实保持默认127.0.0.1也行
host: 192.168.118.131 #其实保持默认127.0.0.1也行
[root@master manifests]#
[root@master manifests]# vim kube-controller-manager.yaml
- --bind-address=192.168.118.131 #修改127.0.0.1为主机的ip,修改为0.0.0.0也行
host: 192.168.118.131 #其实保持默认127.0.0.1也行
host: 192.168.118.131 #其实保持默认127.0.0.1也行
[root@master manifests]#
#发现修改后scheduler和controller-manager pod消失了,一直也没有重新创建pod
#所以重启kubelet后pod都正常了
[root@master manifests]# systemctl restart kubelet.service
现在查看grafana的界面已经监控到Scheduler和Controllermanager服务了。
卸载kube-prometheus
如果不需要kube-prometheus了直接delete掉对应的资源清单创建的资源即可:
[root@master ~]# cd /root/kube-prometheus/kube-prometheus-release-0.10/
[root@master ~]# kubectl delete -f manifests/
[root@master ~]# kubectl delete -f manifests/setup/
至此,所有的系统组件都已经被kube-prometheus监控起来的,kube-prometheus搭建完成。
kube-prometheus监控nginx-ingress
nginx-ingress
是流量的入口,很重要,我们需要监控nginx-ingress
,默认 kube-prometheus
并没有监控nginx-ingress
,我们需要自己创建serviceMonitor
监控nginx-ingress
。
[root@master ~]# kubectl get ds ingress-nginx-controller -n ingress-nginx
[root@master ~]# kubectl get pods -n ingress-nginx
NAME READY STATUS RESTARTS AGE
ingress-nginx-controller-g7vmn 1/1 Running 5 (26m ago) 16h
ingress-nginx-controller-sg2kq 1/1 Running 0 16h
[root@master ~]# kubectl get pods ingress-nginx-controller-g7vmn -n ingress-nginx -oyaml #通过查看yaml可以得知pod暴露10254这个指标端口
[root@master ~]# curl 192.168.118.132:10254/metrics #查看pod端口10254暴露的指标,有输出一大堆内容
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.5143e-05
go_gc_duration_seconds{quantile="0.25"} 0.000105286
go_gc_duration_seconds{quantile="0.5"} 0.000160581
..............
go_goroutines 91
#创建servicemonitors来监控nginx-ingress
[root@master ~]# kubectl get servicemonitors coredns -n monitoring -oyaml #随便拿一个servicemonitor来改造一下
#创建nginx-ingress的servicemonitors
[root@master servicemonitor-nginx-ingress]# vim servicemonitor-nginx-ingress.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
name: ingress-nginx
namespace: monitoring
spec:
endpoints:
- path: /metrics
interval: 15s
port: metrics
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- ingress-nginx
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
[root@master servicemonitor-nginx-ingress]#
#创建service
#我们发现已经有对应的标签的service,但是这个service里面并没有名称叫做metrics的端口
[root@master ~]# kubectl get svc -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller-admission ClusterIP 10.102.72.255 <none> 443/TCP 245d
[root@master ~]# kubectl edit svc ingress-nginx-controller-admission -n ingress-nginx
ports:
- appProtocol: https
name: https-webhook
port: 443
protocol: TCP
targetPort: webhook
- name: metrics #追加这一段,表示在service上启动一个10245的端口,对应的pod的10245端口
port: 10254
protocol: TCP
targetPort: 10254
[root@master ~]# curl 10.102.72.255:10254/metrics #这时我们通过service的10254端口就能读取到metrics指标了
我们发现prometheus页面上targets(http://192.168.118.131:32539/targets
)里面还是没有显示nginx-ingress,查prometheus-k8s-0
这个pod 看到报错:
#查看prometheus-k8s-0这个pod 看到报错了,forbidden权限不足
[root@master servicemonitor-nginx-ingress]# kubectl logs -n monitoring prometheus-k8s-0 -c prometheus
ts=2022-10-21T04:57:14.829Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"services\" in API group \"\" in the namespace \"ingress-nginx\""
# 查看prometheus创建的集群角色
[root@master ~]# kubectl get clusterroles prometheus-k8s -oyaml
apiVersion: rbac.authorization.k8s.io/v1
........
rules: #默认权限太少了
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
# 修改集群角色的权限
[root@master ~]# kubectl edit clusterroles prometheus-k8s
rules: #添加权限
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
这是,我们发现prometheus页面上targets(http://192.168.118.131:32539/targets
)有显示nginx-ingress了,说明已经监控起来了。
kube-prometheus监控redis主从
# 在若依命名空间中存在一个redis主从复制集群,我们需要使用kube-prometheus监控redis主从复制集群
[root@dev-master ~]# kubectl -n ruoyi get po -l app.kubernetes.io/instance=redis
NAME READY STATUS RESTARTS AGE
redis-master-0 1/1 Running 1 78d
redis-replicas-0 1/1 Running 2 78d
redis-replicas-1 1/1 Running 1 7d
redis-replicas-2 1/1 Running 1 78d
[root@dev-master ~]# kubectl -n ruoyi get svc -l app.kubernetes.io/name=redis
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-headless ClusterIP None <none> 6379/TCP 106d
redis-master ClusterIP 10.109.12.23 <none> 6379/TCP 106d
redis-replicas ClusterIP 10.109.78.217 <none> 6379/TCP 106d
# 使用helm安装redis-exporter
helm pull kubegemsapp/prometheus-redis-exporter --version=4.7.4
tar xf prometheus-redis-exporter-4.7.4.tgz
cd prometheus-redis-exporter/
# 修改values.yaml文件
vim values.yaml
[root@dev-master prometheus-redis-exporter]# grep -Ev '#|$^' values.yaml
rbac:
create: false
pspEnabled: false
serviceAccount:
create: false
name:
replicaCount: 1
image:
repository: registry.cn-beijing.aliyuncs.com/kubegemsapp/redis_exporter
tag: v1.27.0
pullPolicy: IfNotPresent
nameOverride: redis-exporter
fullnameOverride: redis-exporter
extraArgs: {}
customLabels: {}
securityContext: {}
env: {}
service:
type: ClusterIP
port: 9121
annotations: {}
labels: {}
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 20m
memory: 64Mi
nodeSelector: {}
tolerations: []
affinity: {}
redisAddress: redis://redis-master.ruoyi.svc:6379 # 指定redis的命名空间,这里指定的是redis-master 的service
annotations: {}
labels: {}
redisAddressConfig:
enabled: false
configmap:
name: ""
key: ""
serviceMonitor:
enabled: true
interval: 60s
telemetryPath: /metrics
timeout: 10s
additionalLabels: {}
alerts:
enabled: true
auth:
enabled: true
secret:
name: "redis" # 这里设置redis的secret
key: "redis-password"
redisPassword: "password"
[root@dev-master prometheus-redis-exporter]#
# 上面我们设置了redis的secret,由于是在ruoyi命名空间的,所以需要添加过来
kubectl -n ruoyi get secrets redis -oyaml > redis-secrets.yaml
vim redis-secrets.yaml # 修改命名空间为monitoring
kubectl -n monitoring create -f redis-secrets.yaml
# 部署redis-exporter
helm -n monitoring template redis ./
helm -n monitoring install redis ./
kubectl -n monitoring get po redis-exporter-75ccfb99cf-v4644 -owide
kubectl -n monitoring logs redis-exporter-75ccfb99cf-v4644
# 使用curl命令直接查看pod ID的指标
curl 10.244.0.83:9121/metrics
# 最后在prometheus server 中targets栏已经看到redis-exporter了,然后在grafana导入模板即可
解读monitoring资源
# 我们可以发现,kube-prometheus是通过定义一个叫做monitoring的资源对象来实现监控的,而monitoring定义其对应的service,service又关联了pod
# 这里以上面的redis为例
[root@dev-master ~]# kubectl -n monitoring get servicemonitors.monitoring.coreos.com redis-exporter -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor # 这是kube-prometheus自定义CRD
metadata:
annotations:
meta.helm.sh/release-name: redis
meta.helm.sh/release-namespace: monitoring
labels:
app: redis-exporter
app.kubernetes.io/managed-by: Helm
name: redis-exporter
namespace: monitoring
spec:
endpoints: # endpoints定义端点
- interval: 60s # 采集时间,每个60s秒采集一次
path: /metrics # 采集的路径接口是/metrics
scrapeTimeout: 10s # 超时时间
targetPort: 9121 # service对应的目标端口
jobLabel: redis-exporter
namespaceSelector: #service命名空间选择器
matchNames:
- monitoring
selector: # 匹配哪个service
matchLabels:
app: redis-exporter
# 由此,我们解读一下这个ServiceMonitor,其含义是匹配monitoring命名空间中标签是app=redis-exporter的service的9121端口的/metrics接口
[root@dev-master ~]# kubectl -n monitoring get svc -l app=redis-exporter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-exporter ClusterIP 10.102.22.146 <none> 9121/TCP 99m
更多推荐
所有评论(0)