Prometheus 监控外部 Kubernetes 集群
prometheus监控外部的k8s集群
·
背景
实际环境中很多企业是将 Prometheus 单独部署在集群外部的,甚至直接监控多个 Kubernetes 集群,虽然不推荐这样去做,因为 Prometheus 采集的数据量太大,或大量消耗资源,比较推荐的做法有:
- 用不同的 Prometheus 实例监控不同的集群,然后用一种工具(比如grafana,prometheus作为数据源)进行汇总;
- 搭建一个资源很大的中心prometheus,然后在各个集群各建立一个实例,然后让各个实例的数据推送到中心prometheus上。
但是使用 Prometheus 监控外部的 Kubernetes 集群这个需求还是非常有必要的。
搭建步骤
创建用户授权
下面这个文件是需要再 被prometheus监控的k8s集群 执行
$ cat prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: tools
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: tools
这里有个踩坑点,再创建sa(ServiceAccount)文件的时候,使用下面命令查看的时候,没有token
$ kubectl describe secret prometheus -n tools
这是因为:k8s只有在1.23及之前版本的集群中,ServiceAccount才会自动创建Secret。之后的版本需要自己创建secret,然后在跟sa绑定上
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
name: prometheus
namespace: tools
annotations:
kubernetes.io/service-account.name: "prometheus"
将上述文件进行执行:
[root@iZ2ze1ut8g7ndn5d2soajcZ ~]# kubectl apply -f prometheus-rbac.yaml
serviceaccount/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
相关删除命令:
kubectl delete clusterrolebinding prometheus kubectl delete serviceaccount prometheus -n tools kubectl delete clusterrole prometheus
获取token
$ kubectl describe secrets prometheus -n tools
Name: prometheus-token-whlv8
Namespace: tools
Labels: <none>
Annotations: kubernetes.io/service-account.name: prometheus
kubernetes.io/service-account.uid: dcf1bf81-4636-4511-9332-293a320c3d60
Type: kubernetes.io/service-account-token
Data
====
token: eyJhbGciOiJSUzI1NiIsImtpZCI6IjdSMGhJMnhFaThNSjg3SFFsRGJ1bUljR0lMbWZCR0lGWUw3SjN3WVhPT1UifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJ0b29scyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLXRva2VuLXdobHY4Iiwia3ViZXJuZXRlcy5pby9zZXJ291bnQubmFtZSI6InByb21ldGhldXMiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJkY2YxYmY4MS00Mi0yOTNhMzIwYzNkNjAiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6dG9vbHM6cHJvbWV0aGV1cyJ9.zFLLIddFuk5CfjEyFWCcNguzzmutllhtYtfuybuQdx47lQ1R_iUdMUifhySICMVJ_XcPBx1wSNVRzbikQ3DRVp4RfwxJH1vWpvX0msHa_aDzQrniEwOcg9zMNTzczJq3L8d8VengSb1_Lpri4Qnk23XlfFj2f3zgmG91nzgW276nCF4cWZfIRlHYoHgkWipqJak_GdII7dIpBpEIdy9F98uKeDwQ-meMZnBF-_KqAiQkKnsswITJV-Wn3Aofbxygqh6q1dCKJ1SrU7DMqpSKmgPFiuPSb4qxg
ca.crt: 1180 bytes
namespace: 5 bytes
创建k8s.token
在prometheus容器重创建k8s.token文件
[root@monitoring prometheus]# pwd
/opt/prometheus
[root@monitoring prometheus]# vim k8s.token
[root@monitoring prometheus]# cat k8s.token
eyJhbGciOiJSUzI1NiIsImtpZCI6IjdSMGhJMnhFaThNSjg3SFFsRGJ1bUljR0lMbWZCR0lGWUw3SjN3WVhPT1UifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJ0b29scyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLXRva2VuLXdobHY4Iiwia3ViZXJuZXRlcy5pby9zZXJ291bnQubmFtZSI6InByb21ldGhldXMiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJkY2YxYmY4MS00Mi0yOTNhMzIwYzNkNjAiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6dG9vbHM6cHJvbWV0aGV1cyJ9.zFLLIddFuk5CfjEyFWCcNguzzmutllhtYtfuybuQdx47lQ1R_iUdMUifhySICMVJ_XcPBx1wSNVRzbikQ3DRVp4RfwxJH1vWpvX0msHa_aDzQrniEwOcg9zMNTzczJq3L8d8VengSb1_Lpri4Qnk23XlfFj2f3zgmG91nzgW276nCF4cWZfIRlHYoHgkWipqJak_GdII7dIpBpEIdy9F98uKeDwQ-meMZnBF-_KqAiQkKnsswITJV-Wn3Aofbxygqh6q1dCKJ1SrU7DMqpSKmgPFiuPSb4qxg
[root@monitoring prometheus]#
编写prometheus-server.yml
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: "metrics-data"
scrape_interval: 15s
scrape_timeout: 15s
metrics_path: '/metrics'
static_configs:
file_sd_configs:
- files:
- prometheus-metrics.yml
#API Serevr节点发现
- job_name: "alik3-apiservers-monitor"
kubernetes_sd_configs:
- role: endpoints
api_server: https://xx.xx.7.xx:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/k8s.token
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/k8s.token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_service_name,__meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
#node节点发现
- job_name: "alik3-nodes-monitor"
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/k8s.token
kubernetes_sd_configs:
- role: node
api_server: https://xx.xxx.xxx:xx
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/k8s.token
relabel_configs:
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: "(.*)"
replacement: "${1}"
action: replace
target_label: LOC
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: "(.*)"
replacement: "NODE"
action: replace
target_label: Type
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
#指定namespace 的pod
- job_name: "alik3-发现指定namespace的所有pod"
kubernetes_sd_configs:
- role: pod
api_server: https://xx.xx.7.xx:xxx
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/k8s.token
namespaces:
names:
- kube-system
- business
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: drop
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
replacement: __param_$1
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: drop
regex: Pending|Succeeded|Failed|Completed
source_labels:
- __meta_kubernetes_pod_phase
#指定Pod发现条件
- job_name: "alik3-指定发现条件的pod"
kubernetes_sd_configs:
- role: pod
api_server: https://xx.xx.7.xx:xxx
tls_config:
insecure_skip_verify: true
bearer_token_file: /opt/prometheus/k8s.token
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
配置参考以及详细信息:prometheus理论+实践
重启prometheus服务
如下:
踩坑点
问题一 : context deadline exceeded
Get "https://192.xx.xx.xx:5444/metrics": context deadline exceeded
解决办法:有可能端口未开放,指定其他端口
还有一种可能是,集群的网段不同,api-server的地址能获取到alik3-apiservers-monitor
、alik3-nodes-monitor
、 alik3-发现指定namespace的所有pod
、alik3-指定发现条件的pod
这些信息,但是里面的详细内容都是通过内网访问的,如果prometheus与监控的k8s集群网络不通的话,那确实回报这个错
参考文献
更多推荐
已为社区贡献10条内容
所有评论(0)