prometheus和grafana、cadvisor（笔记）

1.监控：保证业务系统保障性的手段；反馈系统当前状态；及时预警，减少故障的扩大；通过监控的历史数据追溯问题；2.主流监控，小米，zabbix和prometheuszabbix：图形页面友好，所有配置都可以在图形页面完成，资料多，成熟；学习成本1星期；告警也非常成熟（分级告警），架构成熟（proxy去不同机房可以很好的收集数据和优化数据库端瓶颈）；prometheus：不是很友好，各种配置都需要手写

0914_h

1711人浏览 · 2021-01-08 10:12:02

0914_h · 2021-01-08 10:12:02 发布

1.监控：

2.主流监控，小米，zabbix和prometheus

11.prometheus采集监控指标是有数据模型的

12.prometheus采集方案：

13.启动cadvisor:

14.启动grafana，展示prometheus采集到的数据：

15.grafana不出图，大致有3种情况：

16.prometheus监控主机：

1.监控：

保证业务系统保障性的手段；

反馈系统当前状态；

及时预警，减少故障的扩大；通过监控的历史数据追溯问题；

2.主流监控，小米，zabbix和prometheus

zabbix：图形页面友好，所有配置都可以在图形页面完成，资料多，成熟；学习成本1星期；告警也非常成熟（分级告警），架构成熟（proxy去不同机房可以很好的收集数据和优化数据库端瓶颈）；适用于监控传统机器；

prometheus：不是很友好，各种配置都需要手写到配置文件中；对k8s和docker监控有成熟的解决方案；k8s自动服务发现，对k8s的各种资源对象（pod，node等）监控；适用于监控容器；

3.运维的一个趋势：

容器运维

4.仪表盘

用grafana

也可通过grafana写promQL 查询prometheus的数据

5.概念：

实例：就是prometheus被监控端；

作业：就是实例的集合；代表一组；

6.查看prometheus的配置：

命令：

（6.1）获取configmap

kubectl get configmap -n 命名空间

（6.2）查看某configmap的内容：

kubectl describe configmap configmap名字 -n 命名空间

（6.3）告警是用alertmanager组件完成的；需指定该组件在什么位置

（6.4）rulefile是配置的告警规则（需用promQL编写），达到告警会发给alertmanager，然后发给对应人

（6.5）scrape_configs配置被监控端；

https://www.cnblogs.com/jarno/p/11793710.html

7.prometheus部署方式：

支持容器，yaml，二进制部署；

8.本地部署prometheus：

参考链接：https://prometheus.io/docs/prometheus/latest/installation/

step1：准备prometheus.yml

step2：

docker run -p 9090:9090 -v /Users/xxx/cache-icode/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus --name prometheus

step3：执行http://localhost:9090/graph?g0.expr=&g0.tab=1&g0.stacked=0&g0.range_input=1h

9.监控容器能够监控那些指标？

内存、CPU，硬盘，网络，状态，端口等待

10.如何获取容器监控指标？

docker stats 容器名字  #交互式

docker stats --no-stream 容器名字 #非交互式

docker stats --no-stream 容器名字 | awk 'NR!=1{print $3}'

11.prometheus采集监控指标是有数据模型的

prometheus_api_remote_read_queries{instance="localhost:9090", job="prometheus"}

按着此格式，prometheus才可采集到，并入库；并且还需提供http接口，让prometheus试试的取抓取数据；

12.prometheus采集方案：

方案1：如11

方案2：google开源的容器资源信息和性能采集开源系统，cadvisor系统；（内置到了k8s中，也收集pod容器资源情况）https://github.com/google/cadvisor；以容器方式采集

cadvisor不负责存储数据，只采集数据；（不能看历史数据，都是实时的数据）metrics接口(http://localhost:8090/metrics)，暴露prometheus认可的数据指标，数据模型，每行都是监控指标；prometheus拿到后才可将该监控数据存入自身带的TSDB中

prometheus配置中的指定被采集端，target默认值，默认是http协议，接口就是metric接口；可省略，只写IP和port即可；

13.启动cadvisor:

(1)宿主机不指定8080端口，换个端口如何启动cadvisor？

docker run -d   --volume=/:/rootfs:ro   --volume=/var/run:/var/run:ro   --volume=/sys:/sys:ro   --volume=/var/lib/docker/:/var/lib/docker:ro   --volume=/dev/disk/:/dev/disk:ro   --publish=8090:8080   --detach=true   --name=cadvisor   google/cadvisor:latest

错误：指定宿主和容器端口都是8090，是不对的。

因为容器中进程用的是8080端口；

(2)cadvisor暴露的数据指标：http://localhost:8090/metrics

14.启动grafana，展示prometheus采集到的数据：

官方文档：https://grafana.com/tutorials/grafana-fundamentals/?utm_source=grafana_gettingstarted#1

https://grafana.com/docs/grafana/latest/getting-started/getting-started/

专门展示可视化仪表盘；

docker run -d --name grafana -p 3000:3000 grafana/grafana

(1)默认，grafana的登录用户名是admin，密码是admin

(2)增加数据源；

(3)增加仪表盘；（可用grafana仪表盘库）https://grafana.com/grafana/dashboards

docker 监控仪表盘模板：193

可在grafana上直接导入，位置见下图；

(4)最终效果；

(5)问题：

问题：如何看模块的ID呐？

导入仪表盘模板：

若prometheus没有docker数据（cadvisor），可看下docker的state（代表被监听服务器的状态）是否为UP，此处不是DOWN，不是DOWN的原因是，localhost应该改为IP：

改完之后，发现prometheus能够看到cadvisor采集的数据模型了：

15.grafana不出图，大致有3种情况：

（1）没有数据

（2）时间不对

（3）promql语句写的不对

16.prometheus监控主机：

监控主机需要下载对应exporter;

17.k8s上安装grafana:

参考链接：https://www.qikqiak.com/k8s-book/docs/56.Grafana%E7%9A%84%E5%AE%89%E8%A3%85%E4%BD%BF%E7%94%A8.html

18.prometheus相关知识：

1).后台启动prometheus：/bin/prometheus --config.file=/etc/prometheus/prometheus.yml &

2).上面命令不能代表prometheus成功启动，判断是否启动，可查看prometheus的端口是否正常；利用lsof -i:9090

3).查看prometheus启动日志：docker logs 容器ID

4).这里是prometheus监控的目标，http://ip:port/接口，是prometheus监控接口，此处接口返回的是符合prometheus采集数据模型的监控数据；

5).被监控端安装node_exporter组件（https://prometheus.io/download/），该组件主要负责收集prometheus需要的监控数据，帮助prometheus收集数据；prometheus就可以监听到node上的监控数据了；默认占用9100端口，来收集Linux主机的信息；prometheus配置中监听node_exporter的IP和port，达到prometheus拉取被监控端Linux主机的监控信息；

nohup xxxx & #nohup代表永久的运行下去，解决终端关闭，进程也随之关闭问题，&代表后台运行

另看端口是否占用；

疑问：终端关闭进程关闭跟后台运行的区别是什么？

6).prometheus相关文档：

K8S常用指标分析：https://yasongxu.gitbook.io/container-monitor/yi-.-kai-yuan-fang-an/di-2-zhang-prometheus/metric

Prometheus 入门与实践 https://developer.ibm.com/zh/depmodels/cloud/articles/cl-lo-prometheus-getting-started-and-practice/

https://docs.signalfx.com/en/latest/integrations/agent/monitors/cadvisor.html

容器监控：cAdvisor https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/exporter/commonly-eporter-usage/use-prometheus-monitor-container

19.PromQL使用：

1).同一个节点上的同一个服务，不同cpu的占用时间都有计量和统计，由此可见，要得到该容器服务的CPU使用的总的情况，需要对所有CPU求和，即：

sum(rate(container_cpu_usage_seconds_total{name=~"xxxxx"}[5m]))

可参考：https://blog.csdn.net/palet/article/details/82941402

疑问：CPU指标是计数器（counter）类型，所以grafana 的metrics的配置用到 rate函数

container_cpu_usage_seconds_total是container累计使用的CPU时间，用它除以CPU的总时间，就得到了容器的CPU使用率；可参考：https://www.lijiaocn.com/%E6%8A%80%E5%B7%A7/2018/09/14/prometheus-compute-kubernetes-container-cpu-usage.html

PromQL还内置了大量的函数，例如使用rate()函数，可以计算在单位时间内样本数据的变化情况即增长率.https://www.cnblogs.com/flytor/p/11440759.html

指标都是couter类型，需要对其rate算出使用率；https://yasongxu.gitbook.io/container-monitor/yi-.-kai-yuan-fang-an/di-2-zhang-prometheus/metric

by （xxx）代表：每个的意思；展示每个xxx的某监控指标；

2).sum(rate(container_cpu_usage_seconds_total[5m]))by(grafana)命令可本地（prometheus本地docker部署）实验；

3）rate代表某时间段内每秒平均

sum在这个时间段中每秒平均和

4).https://www.qikqiak.com/k8strain/monitor/prometheus/

cpu的quota（容器cpu的配额）=cpu个数*10w

eg:limit为2时，CPU的quota就是20w;

request的cpu为100m就相当于0.1个cpu

容器CPU limit: container_spec_cpu_quota / container_spec_cpu_period  (参考：https://www.cnblogs.com/JetpropelledSnake/p/10097977.html)

https://stackoverflow.com/questions/40327062/how-to-calculate-containers-cpu-usage-in-kubernetes-with-prometheus-as-monitori

rate：https://songjiayang.gitbooks.io/prometheus/content/promql/summary.html

sum(rate：https://www.lijiaocn.com/%E6%8A%80%E5%B7%A7/2018/09/14/prometheus-compute-kubernetes-container-cpu-usage.html

container_cpu_usage_seconds_total是container累计使用的CPU时间，用它除以CPU的总时间，就得到了容器的CPU使用率;

容器CPU使用率 = 容器累积使用CPU时间 / 容器分配到CPU数量(即CPU总时间)

自己理解的，也不知道对不对：容器CPU使用率=sum(rate(container_cpu_usage_seconds_total{key=value}[5m])) by (pod) /

sum(container_spec_cpu_quota{key=value}/container_spec_cpu_period{key=value}) by (pod)

各个参数的参考：https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md

注：本章按着阿良k8s布道者的课来记录的：https://www.bilibili.com/video/BV1T54y1Q7VY?from=search&seid=2586591700088843890

监听k8s上的容器envoy：https://cloud.tencent.com/developer/article/1472420

作任何事情，都应看外界有没满足当前需求的，避免重复造轮子；

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub