使用prometheus监控的基本方法如下

 对于容器化的服务,如果有metrics接口,可以使用servicemonitor与此pod对应的service建立关联,从而使Prometheus能读取到相关监控数据

如果是没有metrics接口的普通应用,就要通过exportor的方式,利用exportor读取相关监控数据,再由Prometheus读取exportor的metrics接口的方式,实现对此应有的监控

对于mysql,redis等常见的开源软件,有专用的exportor提供监控接口,但对于大部分自己开发的web服务,则要通过黑盒监控的方式来获取监控接口。

部署黑盒监控

部署黑盒监控很简单,直接跑以下yaml文件即可

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: monitoring
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: blackbox-exporter
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: blackbox-exporter
    spec:
      containers:
      - args:
        - --config.file=/mnt/blackbox.yml
        env:
        - name: TZ
          value: Asia/Shanghai
        - name: LANG
          value: C.UTF-8
        image: prom/blackbox-exporter:master
        imagePullPolicy: IfNotPresent
        lifecycle: {}
        name: blackbox-exporter
        ports:
        - containerPort: 9115
          name: web
          protocol: TCP
        resources:
          limits:
            cpu: 20m
            memory: 40Mi
          requests:
            cpu: 10m
            memory: 10Mi
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: false
          runAsNonRoot: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/zoneinfo/Asia/Shanghai
          name: tz-config
        - mountPath: /etc/localtime
          name: tz-config
        - mountPath: /etc/timezone
          name: timezone
        - mountPath: /mnt
          name: config
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /usr/share/zoneinfo/Asia/Shanghai
          type: ""
        name: tz-config
      - hostPath:
          path: /etc/timezone
          type: ""
        name: timezone
      - configMap:
          defaultMode: 420
          name: blackbox-conf
        name: config
---
apiVersion: v1
data:
  blackbox.yml: |-
    modules:
      http_2xx:
        prober: http
      http_post_2xx:
        prober: http
        http:
          method: POST
      tcp_connect:
        prober: tcp
      pop3s_banner:
        prober: tcp
        tcp:
          query_response:
          - expect: "^+OK"
          tls: true
          tls_config:
            insecure_skip_verify: false
      ssh_banner:
        prober: tcp
        tcp:
          query_response:
          - expect: "^SSH-2.0-"
      irc_banner:
        prober: tcp
        tcp:
          query_response:
          - send: "NICK prober"
          - send: "USER prober prober prober :prober"
          - expect: "PING :([^ ]+)"
            send: "PONG ${1}"
          - expect: "^:[^ ]+ 001"
      icmp:
        prober: icmp
kind: ConfigMap
metadata:
  name: blackbox-conf
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: monitoring
spec:
  ports:
  - name: container-1-web-1
    port: 9115
    protocol: TCP
    targetPort: 9115
  selector:
    app: blackbox-exporter
  sessionAffinity: None
  type: ClusterIP

部署后,测试黑盒监控能否抓取k8sdemo各服务的数据

[root@VM-12-8-centos ~]# kubectl get svc
NAME          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
example-app   ClusterIP   10.1.97.130    <none>        8090/TCP         18d
kubernetes    ClusterIP   10.1.0.1       <none>        443/TCP          38d
mall          NodePort    10.1.111.235   <none>        8000:31687/TCP   5d4h
mysql         ClusterIP   10.1.125.65    <none>        3306/TCP         6d4h
mysql-read    ClusterIP   10.1.87.151    <none>        3306/TCP         5d10h
order         ClusterIP   10.1.215.125   <none>        7000/TCP         5d4h
passport      ClusterIP   10.1.116.84    <none>        5000/TCP         5d4h
product       ClusterIP   10.1.38.97     <none>        3000/TCP         5d4h
redis         ClusterIP   10.1.14.65     <none>        6379/TCP         5d6h
review        ClusterIP   10.1.241.227   <none>        9000/TCP         5d4h
shopcart      ClusterIP   10.1.226.250   <none>        6000/TCP         5d4h
[root@VM-12-8-centos ~]# curl http://10.1.20.239:9115/probe?target=10.1.226.250:6000&module=http_2xx
[1] 13423
[root@VM-12-8-centos ~]# # HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 1.2483e-05
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.000911382
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 102
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.000155884
probe_http_duration_seconds{phase="processing"} 0.000324507
probe_http_duration_seconds{phase="resolve"} 1.2483e-05
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.00010323
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 102
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.975630568e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

经测试,是可以正常抓取到监控数据的

配置Prometheus新增job及告警rules

利用prometheus-additional的方式,为Prometheus新增job,为job配置上k8sdemo的各服务接口

- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
      - http://10.1.226.250:6000
      - http://10.1.38.97:3000/healthz/ready
      - http://10.1.116.84:5000
      - http://10.1.215.125:7000/healthz/ready
      - http://10.1.111.235:8000/healthz/ready
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:9115

应用以上yaml文件后,再编辑prometheus,关联这个配置

[root@VM-12-8-centos ~]#  kubectl edit prometheus k8s -n monitoring
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  creationTimestamp: "2022-11-13T14:21:08Z"
  generation: 3
  labels:
    prometheus: k8s
  name: k8s
  namespace: monitoring
  resourceVersion: "4752391"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheuses/k8s
  uid: 98a176a4-f6fa-450c-a9c3-aa9ebe58e30e
spec:
  additionalScrapeConfigs:
    key: prometheus-additional.yaml
    name: additional-scrape-configs
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  baseImage: quay.io/prometheus/prometheus
  nodeSelector:
    kubernetes.io/os: linux
  podMonitorNamespaceSelector: {}
  podMonitorSelector: {}
  replicas: 2
  resources:
    requests:
      memory: 400Mi
  ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules
  secrets:
  - etcd-ssl
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
  version: v2.11.0

重点在于以下这段配置

  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml

配置好后,等上一段时间,打开监控界面,可以看到界面已新增了job

 

然后再新增一个rules,即告警规则,使服务不正常的时候,可以发送告警

可以使用kubectl edit PrometheusRule -n monitoring的方式,在最后新增一个rules

或者通过yaml文件新增一个PrometheusRule

[root@VM-12-8-centos kube-prom]# vim blackbox-rules.yml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: blackbox-rules
  namespace: monitoring
spec:
  groups:
  - name: blackboxjk-k8sdemo
    rules:
    - alert: curlHttpStatus
      expr:  probe_http_status_code{job="blackbox"}>=400 or probe_success{job="blackbox"}==0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: '业务报警: 网站不可访问'
        description: '{{$labels.instance}} 不可访问,请及时查看,当前状态码为{{$value}}'
[root@VM-12-8-centos kube-prom]# kubectl apply -f blackbox-rules.yml
prometheusrule.monitoring.coreos.com/etcd-rules created
[root@VM-12-8-centos kube-prom]# kubectl get PrometheusRule -n monitoring
NAME                   AGE
etcd-rules             85s
prometheus-k8s-rules   19d

配置好后,可以在alert界面看到这个告警规则

 最后测试告警的有效性,是否会发送告警邮件

通过删除一个pod的方式来进行测试

[root@VM-12-8-centos devis]# kubectl get deploy
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
mall                     1/1     1            1           2d3h
mysql                    1/1     1            1           5d8h
nfs-client-provisioner   1/1     1            1           32d
order                    1/1     1            1           47h
passport                 1/1     1            1           47h
product-v1               1/1     1            1           46h
redis                    1/1     1            1           5d5h
review                   1/1     1            1           2d
shopcart                 1/1     1            1           46h
[root@VM-12-8-centos devis]# kubectl delete deploy mall
deployment.extensions "mall" deleted
[root@VM-12-8-centos devis]# kubectl get po
NAME                                     READY   STATUS        RESTARTS   AGE
mall-987568788-jjxld                     0/1     Terminating   0          11m
mysql-85695f9484-v2jhz                   1/1     Running       0          5d8h
mysql-restore-zts24                      0/1     Completed     0          5d6h
nfs-client-provisioner-9494c5c4c-nzvcm   1/1     Running       5          17d
order-6697cfb6c7-tm5kf                   1/1     Running       0          3m32s
passport-748d9c48f6-wgszk                1/1     Running       0          47h
product-v1-5d95f79d65-fr78r              1/1     Running       0          46h
redis-756b947968-5c5rl                   1/1     Running       0          5d5h
review-5bbc4f96b-5zn7k                   1/1     Running       0          2d
shopcart-8c47b75df-wh59z                 1/1     Running       0          46h

删除以后,界面出现了告警

同时收到qq告警邮件

 

 

deploy mall对应的接口是8000,告警无误

最后通过之前在jenkins部署好的deploy,去重新部署相应的pod

 

 随后收到告警恢复的邮件,测试成功

 

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐