1 使用炫酷的grafana出图

prometheus的dashboard虽然号称拥有多种多样的图表,但是实在太简陋了,一般都用专业的grafana工具来出图
grafana官方dockerhub地址
grafana官方github地址
grafana官网

1.1 部署grafana

1.1.1 准备镜像

docker pull grafana/grafana:5.4.2
docker tag  6f18ddf9e552 harbor.zq.com/infra/grafana:v5.4.2
docker push harbor.zq.com/infra/grafana:v5.4.2

准备目录

mkdir /data/k8s-yaml/grafana
cd    /data/k8s-yaml/grafana

1.1.2 准备rbac资源清单

cat >rbac.yaml <<'EOF'
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
rules:
- apiGroups:
  - "*"
  resources:
  - namespaces
  - deployments
  - pods
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: grafana
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: grafana
subjects:
- kind: User
  name: k8s-node
EOF

1.1.3 准备dp资源清单

cat >dp.yaml <<'EOF'
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: grafana
    name: grafana
  name: grafana
  namespace: infra
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 7
  selector:
    matchLabels:
      name: grafana
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: grafana
        name: grafana
    spec:
      containers:
      - name: grafana
        image: harbor.zq.com/infra/grafana:v5.4.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /var/lib/grafana
          name: data
      imagePullSecrets:
      - name: harbor
      securityContext:
        runAsUser: 0
      volumes:
      - nfs:
          server: hdss7-200
          path: /data/nfs-volume/grafana
        name: data
EOF

创建frafana数据目录

mkdir /data/nfs-volume/grafana

1.1.4 准备svc资源清单

cat >svc.yaml <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: infra
spec:
  ports:
  - port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: grafana
EOF

1.1.5 准备ingress资源清单

cat >ingress.yaml <<'EOF'
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grafana
  namespace: infra
spec:
  rules:
  - host: grafana.zq.com
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 3000
EOF

1.1.6 域名解析

vi /var/named/zq.com.zone
grafana            A    10.4.7.10

systemctl restart named

1.1.7 应用资源配置清单

kubectl apply -f http://k8s-yaml.zq.com/grafana/rbac.yaml
kubectl apply -f http://k8s-yaml.zq.com/grafana/dp.yaml
kubectl apply -f http://k8s-yaml.zq.com/grafana/svc.yaml
kubectl apply -f http://k8s-yaml.zq.com/grafana/ingress.yaml

1.2 使用grafana出图

1.2.1 浏览器访问验证

访问http://grafana.zq.com,默认用户名密码admin/admin
能成功访问表示安装成功
进入后立即修改管理员密码为admin123

1.2.2 进入容器安装插件

grafana确认启动好以后,需要进入grafana容器内部,安装以下插件

kubectl -n infra exec  -it grafana-d6588db94-xr4s6 /bin/bash
# 以下命令在容器内执行
grafana-cli plugins install grafana-kubernetes-app
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install briangann-gauge-panel
grafana-cli plugins install natel-discrete-panel

1.2.3 配置数据源

添加数据源,依次点击:左侧锯齿图标-->add data source-->Prometheus
mark

添加完成后重启grafana

kubectl -n infra delete pod grafana-7dd95b4c8d-nj5cx

1.2.4 添加K8S集群信息

启用K8S插件,依次点击:左侧锯齿图标-->Plugins-->kubernetes-->Enable
新建cluster,依次点击:左侧K8S图标-->New Cluster
mark

1.2.5 查看k8s集群数据和图表

添加完需要稍等几分钟,在没有取到数据之前,会报http forbidden,没关系,等一会就好。大概2-5分钟。
mark

点击Cluster Dashboard
mark

dashboard可以直接导入优化好的json文件 

2 配置alert告警插件

2.1 部署alert插件

2.1.1 准备docker镜像

docker pull docker.io/prom/alertmanager:v0.14.0
docker tag  23744b2d645c harbor.zq.com/infra/alertmanager:v0.14.0
docker push harbor.zq.com/infra/alertmanager:v0.14.0

准备目录

mkdir /data/k8s-yaml/alertmanager
cd /data/k8s-yaml/alertmanager

2.1.2 准备cm资源清单

cat >cm.yaml <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: infra
data:
  config.yml: |-
    global:
      # 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'xxx@163.com'
      smtp_auth_username: 'xxx@163.com'
      smtp_auth_password: 'xxxxxx'
      smtp_require_tls: false
    templates:   
      - '/etc/alertmanager/*.tmpl'
    # 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s

      # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      group_interval: 5m

      # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      repeat_interval: 5m

      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default

    receivers:
    - name: 'default'
      email_configs:
      - to: 'xxxx@qq.com'
        send_resolved: true
        html: '{{ template "email.to.html" . }}' 
        headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonAnnotations.summary }}" }   
  email.tmpl: |
    {{ define "email.to.html" }}
    {{- if gt (len .Alerts.Firing) 0 -}}
    {{ range .Alerts }}
    告警程序: prometheus_alert <br>
    告警级别: {{ .Labels.severity }} <br>
    告警类型: {{ .Labels.alertname }} <br>
    故障主机: {{ .Labels.instance }} <br>
    告警主题: {{ .Annotations.summary }}  <br>
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    {{ end }}{{ end -}}
    
    {{- if gt (len .Alerts.Resolved) 0 -}}
    {{ range .Alerts }}
    告警程序: prometheus_alert <br>
    告警级别: {{ .Labels.severity }} <br>
    告警类型: {{ .Labels.alertname }} <br>
    故障主机: {{ .Labels.instance }} <br>
    告警主题: {{ .Annotations.summary }} <br>
    触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
    恢复时间: {{ .EndsAt.Format "2006-01-02 15:04:05" }} <br>
    {{ end }}{{ end -}}
    
    {{- end }}
EOF

微信钉钉报警配置文件

config.yaml

global:
  # 在没有报警的情况下声明为已解决的时间
  resolve_timeout: 5m
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
templates:   
  - '/etc/alertmanager/*.tmpl'
# 所有报警信息进入后的根路由,用来设置报警的分发策略
route:
  # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
  group_by: ['alertname', 'cluster']
  # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
  group_wait: 20s
  # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
  group_interval: 20s
  # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
  repeat_interval: 1h
  # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
  receiver: ops_wechat_ding
receivers:
- name: 'ops_wechat_ding'
  wechat_configs:
    - send_resolved: true
      api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
      corp_id: '********'
      agent_id: '********'
      api_secret: '********'
      to_user: '' 
      to_party: '2'
      message: '{{ template "wechat.default.message" . }}'
  webhook_configs:
    - send_resolved: true
      url: 'http://prometheus-webhook-dingtalk.monitoring.svc.cluster.local/dingtalk/ops_dingding/send'

 wechat.tmpl

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}

==========异常告警==========
告警程序: prometheus_alert
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障实例: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
触发时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.kubernetes_pod_name) 0 }}
POD_NAME: {{ .Labels.kubernetes_pod_name }}
{{- end }}
============END============

{{- end }}
{{- end }}
 
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}

==========异常恢复==========
告警程序: prometheus_alert
告警级别: {{ .Labels.severity }}
告警类型: {{ .Labels.alertname }}
故障实例: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
触发时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.kubernetes_pod_name) 0 }}
POD_NAME: {{ .Labels.kubernetes_pod_name }}
{{- end }}
============END============

{{- end }}
{{- end }}
{{- end }}

钉钉需要安装一个官方提供的插件

 1、下载安装包制作镜像

prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

Dockerfile

ARG ARCH="amd64"
ARG OS="linux"

FROM quay.io/prometheus/busybox-${OS}-${ARCH}:latest
LABEL maintainer="zhao yun fei  <yangg_zhao@126.com>"
RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

ARG ARCH="amd64"
ARG OS="linux"
RUN mkdir -p /etc/prometheus-webhook-dingtalk/
COPY prometheus-webhook-dingtalk   /bin/prometheus-webhook-dingtalk
COPY config.yml                                 /etc/prometheus-webhook-dingtalk/config.yml
COPY contrib                                            /etc/prometheus-webhook-dingtalk/

RUN mkdir -p /prometheus-webhook-dingtalk

USER       root
EXPOSE     8060
WORKDIR    /prometheus-webhook-dingtalk
#ENTRYPOINT [ "/bin/echo" ]
#CMD        [ "222" ]
ENTRYPOINT [ "/bin/prometheus-webhook-dingtalk" ]
CMD        [ "--config.file=/etc/prometheus-webhook-dingtalk/config.yml" ]

 config.yml 文件和 contrib目录使用解压出来的默认就好,后续通过configmap的形式挂载进去

configmap的config.yml

templates:
  - /etc/prometheus-webhook-dingtalk/template/*.tmpl
targets:
  ops_dingding:
    url: https://oapi.dingtalk.com/robot/send?access_token=9b1c05a411b8be10a6f6f6cf687421bb722238e6934f7052c4c7c215b3c0cb57 
    message:
      # Use legacy template
      title: '{{ template "ops.title" . }}'
      text: '{{ template "ops.content" . }}'

 钉钉插件的告警模板dingding.tmpl

{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}

{{ define "__ops_alert_list" }}{{ range . }}
---
**告警程序**: prometheus_alert

**告警级别**: {{ .Labels.severity }}

**告警类型**: {{ .Labels.alertname }}

**故障实例**: {{ .Labels.instance }}

**告警主题**: {{ .Annotations.summary }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**POD_NAME**: {{ .Labels.kubernetes_pod_name }}
{{ end }}{{ end }}

{{ define "__ops_resolved_list" }}{{ range . }}
---
**告警程序**: prometheus_alert

**告警级别**: {{ .Labels.severity }}

**告警类型**: {{ .Labels.alertname }}

**故障实例**: {{ .Labels.instance }}

**告警主题**: {{ .Annotations.summary }}

**告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

**POD_NAME**: {{ .Labels.kubernetes_pod_name }}
{{ end }}{{ end }}

{{ define "ops.title" }}
{{ template "__subject" . }}
{{ end }}

{{ define "ops.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====侦测到{{ .Alerts.Firing | len  }}个故障====**
{{ template "__ops_alert_list" .Alerts.Firing }}
---
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
**====恢复{{ .Alerts.Resolved | len  }}个故障====**
{{ template "__ops_resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}

  

2.1.3 准备dp资源清单

cat >dp.yaml <<'EOF'
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: infra
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: harbor.zq.com/infra/alertmanager:v0.14.0
        args:
          - "--config.file=/etc/alertmanager/config.yml"
          - "--storage.path=/alertmanager"
        ports:
        - name: alertmanager
          containerPort: 9093
        volumeMounts:
        - name: alertmanager-cm
          mountPath: /etc/alertmanager
      volumes:
      - name: alertmanager-cm
        configMap:
          name: alertmanager-config
      imagePullSecrets:
      - name: harbor
EOF

2.1.4 准备svc资源清单

cat >svc.yaml <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: infra
spec:
  selector: 
    app: alertmanager
  ports:
    - port: 80
      targetPort: 9093
EOF

2.1.5 应用资源配置清单

kubectl apply -f http://k8s-yaml.zq.com/alertmanager/cm.yaml
kubectl apply -f http://k8s-yaml.zq.com/alertmanager/dp.yaml
kubectl apply -f http://k8s-yaml.zq.com/alertmanager/svc.yaml

2.2 K8S使用alert报警

2.2.1 k8s创建基础报警规则文件

cat >/data/nfs-volume/prometheus/etc/rules.yml <<'EOF'
groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
  - alert: OutOfInodes
    expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Out of inodes (instance {{ $labels.instance }})"
      description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
  - alert: OutOfDiskSpace
    expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
  - alert: UnusualNetworkThroughputIn
    expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual network throughput in (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
  - alert: UnusualNetworkThroughputOut
    expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual network throughput out (instance {{ $labels.instance }})"
      description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskReadRate
    expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk read rate (instance {{ $labels.instance }})"
      description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskWriteRate
    expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk write rate (instance {{ $labels.instance }})"
      description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
  - alert: UnusualDiskReadLatency
    expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk read latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
  - alert: UnusualDiskWriteLatency
    expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Unusual disk write latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
- name: http_status
  rules:
  - alert: ProbeFailed
    expr: probe_success == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Probe failed (instance {{ $labels.instance }})"
      description: "Probe failed (current value: {{ $value }})"
  - alert: StatusCode
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Status Code (instance {{ $labels.instance }})"
      description: "HTTP status code is not 200-399 (current value: {{ $value }})"
  - alert: SslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
      description: "SSL certificate expires in 30 days (current value: {{ $value }})"
  - alert: SslCertificateHasExpired
    expr: probe_ssl_earliest_cert_expiry - time()  <= 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "SSL certificate has expired (instance {{ $labels.instance }})"
      description: "SSL certificate has expired already (current value: {{ $value }})"
  - alert: BlackboxSlowPing
    expr: probe_icmp_duration_seconds > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow ping (instance {{ $labels.instance }})"
      description: "Blackbox ping took more than 2s (current value: {{ $value }})"
  - alert: BlackboxSlowRequests
    expr: probe_http_duration_seconds > 2 
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow requests (instance {{ $labels.instance }})"
      description: "Blackbox request took more than 2s (current value: {{ $value }})"
  - alert: PodCpuUsagePercent
    expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"
EOF

2.2.2 K8S 更新配置

在prometheus配置文件中追加配置:

cat >>/data/nfs-volume/prometheus/etc/prometheus.yml <<'EOF'
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager"]
rule_files:
 - "/data/etc/rules.yml"
EOF

重载配置:

curl -X POST http://prometheus.zq.com/-/reload

mark

以上这些就是我们的告警规则

2.2.3 测试告警

把test命名空间里的dubbo-demo-service给停掉

blackbox里信息已报错,alert里面项目变黄了
mark
等到alert中项目变为红色的时候就开会发邮件告警
mark

如果需要自己定制告警规则和告警内容,需要研究一下promql,自己修改配置文件。

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐