Prometheus + Grafana + Node_exporter部署实现监控服务器和微服务并通过邮件报警
Prometheus + Grafana + Node_exporter部署使用部署环境说明IP组件http://192.168.146.18/Prometheus & grafana& node_exporterhttp://192.168.146.19/node_exporterhttp://192.168.146.17/node_exporter部署Prometheus1、下
Prometheus + Grafana + Node_exporter部署使用
部署环境说明
IP | 组件 |
---|---|
http://192.168.146.18/ | Prometheus & grafana & node_exporter |
http://192.168.146.19/ | node_exporter |
http://192.168.146.17/ | node_exporter |
部署Prometheus
1、下载
https://prometheus.io/download/
2、安装
上传prometheus-2.33.1.linux-amd64.tar.gz 到 /home/software
tar -xvf prometheus-2.33.1.linux-amd64.tar.gz
mv xxx prometheus 修改文件名
查看是否安装成功
./prometheus --version
添加组
groupadd prometheus
创建用户
useradd -g prometheus prometheus
更改用户权限
chown -R prometheus:prometheus /home/software/prometheus/
添加服务
vim /etc/systemd/system/prometheus.service(添加服务)
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target
[Service]
# Type设置为notify时,服务会不断重启
Type=simple
User=prometheus
# --storage.tsdb.path是可选项,默认数据目录在运行目录的./dada目录中
ExecStart=/home/software/prometheus/prometheus --config.file=/home/software/prometheus/prometheus.yml --storage.tsdb.path=/home/software/prometheus
Restart=on-failure
[Install]
WantedBy=multi-user.target
systemctl enable prometheus.service (开机自启)
部署Node_exporter
1、下载node_exporter
在prometheus官网
tar -vzxf node_exporter-1.3.1.linux-amd64
mv node_exporter-1.3.1.linux-amd64 node_exporter
2、安装node_exporter
(如果node_exporter与Prometheus不在同一台机器,请创建Prometheus用户,否则无法启动)
添加组
groupadd prometheus
创建用户
useradd -g prometheus prometheus
更改用户权限
chown -R prometheus:prometheus /home/software/node_exporter/
注意:更改用户权限时主要查看当前文件夹的父文件夹权限是否是其他非root用户,是的话,需要将文件夹移至父文件夹为root用户的文件夹下
添加服务
vim /etc/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/home/software/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable node_exporter
3、Prometheus配置
vim /home/software/prometheus/prometheus.yml
# 全局配置
global:
scrape_interval: 15s # 设置抓取(pull)时间间隔,默认是1m
evaluation_interval: 15s # 设置rules评估时间间隔,默认是1m
# scrape_timeout is set to the global default (10s).
# 告警管理配置,暂未使用,默认配置
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# 加载rules,并根据设置的时间间隔定期评估,暂未使用,默认配置
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# 抓取(pull),即监控目标配置
# 默认只有主机本身的监控配置
scrape_configs:
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
#新增
- job_name: 'linux'
static_configs:
- targets: ['10.200.114.162:9100'] #被监控端服务器ip
重启
systemctl restart prometheus.service
Grafana部署以及使用
1、下载和安装
#进入下载目录下载 cd /home/software
wget https://dl.grafana.com/oss/release/grafana-6.6.0-1.x86_64.rpm
#yum安装
yum localinstall grafana-6.6.0-1.x86_64.rpm
#启动 & 开机自启
systemctl start grafana-server
systemctl enable grafana-server
访问http://192.168.146.18:3000/
2、配置数据源和dashboard
配置数据源
点击后点击add data source,选择prometheus
选择好url后直接save & test
配置dashboard
输入pid后load后选择数据源为prometheus直接import
推荐服务器面板id为8919,微服务面板id10280
微服务监控
1、获取注册中心所有服务
依赖
<dependency>
<groupId>net.dreamlu</groupId>
<artifactId>mica-prometheus</artifactId>
<version>2.6.3</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
使用:https://gitee.com/596392912/mica/tree/master/mica-prometheus
发现服务存放list
import com.talkweb.twiot.device.pojo.grafana.TargetGroup;
import org.springframework.cloud.client.ServiceInstance;
import org.springframework.cloud.client.discovery.DiscoveryClient;
import org.springframework.context.ApplicationEventPublisher;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.*;
/**
* @program: tw-iot
* @description: prometheus
* @author: LiuZhuzheng
* @create: 2022-02-11 16:47
**/
@RestController
@RequestMapping("actuator/prometheus")
public class PrometheusApi {
private final DiscoveryClient discoveryClient;
private final ApplicationEventPublisher eventPublisher;
@GetMapping("sd")
public List<TargetGroup> getList() {
List<String> serviceIdList = discoveryClient.getServices();
if (serviceIdList == null || serviceIdList.isEmpty()) {
return Collections.emptyList();
}
List<TargetGroup> targetGroupList = new ArrayList<>();
for (String serviceId : serviceIdList) {
List<ServiceInstance> instanceList = discoveryClient.getInstances(serviceId);
List<String> targets = new ArrayList<>();
for (ServiceInstance instance : instanceList) {
targets.add(String.format("%s:%d", instance.getHost(), instance.getPort()));
}
Map<String, String> labels = new HashMap<>(2);
labels.put("__meta_prometheus_job", serviceId);
targetGroupList.add(new TargetGroup(targets, labels));
}
return targetGroupList;
}
public PrometheusApi(DiscoveryClient discoveryClient, ApplicationEventPublisher eventPublisher) {
this.discoveryClient = discoveryClient;
this.eventPublisher = eventPublisher;
}
}
targetGroup bean类
package com.talkweb.twiot.device.pojo.grafana;
/**
* @program: tw-iot
* @description: grafana
* @author: LiuZhuzheng
* @create: 2022-02-11 16:45
**/
import jdk.nashorn.internal.objects.annotations.Getter;
import java.util.List;
import java.util.Map;
/**
* prometheus http sd 模型
*
* @author L.cm
*/
public class TargetGroup {
private final List<String> targets;
private final Map<String, String> labels;
public List<String> getTargets() {
return targets;
}
public Map<String, String> getLabels() {
return labels;
}
public TargetGroup(List<String> targets, Map<String, String> labels) {
this.targets = targets;
this.labels = labels;
}
}
配置yml文件
management:
endpoints:
web:
exposure:
include: "*"
endpoint:
health:
show-details: always
prometheus:
enabled: true
metrics:
tags:
application: ${spring.application.name}
配置prometheus
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'linux'
static_configs:
- targets: ['192.168.146.17:9101','192.168.146.18:9101','192.168.146.19:9101','192.168.141.77:9101'] #被监控端服务器ip
#新增部分
- job_name: 'micax-cloud'
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /actuator/prometheus
scheme: http
http_sd_configs:
- url: 'http://192.168.214.209:48200/actuator/prometheus/sd'
static_configs:
- targets: ['192.168.214.209:48200']
启动后可以看见所有服务,但是在未配置微服务时以下微服务状态都是down
2、服务注册
依赖
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
和上述一样yml文件进行配置
配置prometheus
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'linux'
static_configs:
- targets: ['192.168.146.17:9101','192.168.146.18:9101','192.168.146.19:9101','192.168.141.77:9101'] #被监控端服务器ip
#新增部分
- job_name: 'micax-cloud'
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /actuator/prometheus
scheme: http
#http_sd_configs:
#- url: 'http://192.168.214.209:48200/actuator/prometheus/sd'
static_configs:
- targets: ['192.168.214.209:48200']
此时服务状态为up
Docker容器监控(prometheus直接pull抓取数据实现监控)
1、构建镜像
cadvisor 可以用来监控 docker 和 k8s 容器。
构建镜像
docker pull google/cadvisor:latest
注意:arm 使用这个镜像
docker pull budry/cadvisor-arm
镜像导出
docker save google/cadvisor:latest > cadvisor_latest_x86_x64.tar
镜像导入
docker load -i cadvisor_latest_arm64.tar
2、配置说明
--global_housekeeping_interval=1m0s: Interval between global housekeepings
--housekeeping_interval=1s: Interval between container housekeepings
--max_housekeeping_interval=1m0s: Largest interval to allow between container housekeepings (default 1m0s)
配置详情地址:https://github.com/google/cadvisor/tree/master/docs
3、运行
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=18381:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest \
--housekeeping_interval=10s
4、修改prometheus配置
vim prometheus.yml
添加以下配置
- job_name: 'edge-docker'
static_configs:
- targets: ['192.168.253.4:18381']
重启prometheus
systemctl restart prometheus
Docker容器监控(通过pushgateway抓取数据)
相比较prometheus直接pull节点数据,该方式适用于prometheus无法直接pull指定节点数据的场景,利用pushgateway,将节点数据定时push到pushgateway,prometheus只需要去pullpushgateway的数据即可
1、下载安装pushgateway
**下载地址:**https://github.com/prometheus/pushgateway/releases
安装
将对应架构的安装包解压到 /opt/pushgateway
目录。
新增 vi /etc/systemd/system/pushgateway.service
[Unit]
Description=pushgateway
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/pushgateway/pushgateway
Restart=on-failure
[Install]
WantedBy=multi-user.target
然后启动
systemctl daemon-reload
### 启用开机启动
systemctl enable pushgateway.service
### 启动
systemctl start pushgateway.service
### 查看启动状态
systemctl status pushgateway.service
查看 ui, 浏览器打开:http://ip:9091
2、节点数据push到pushgateway(收集边缘的 cadvisor 数据)
1. 在边缘节点部署 cadvisor
详见上一章节(Docker容器监控(prometheus直接pull抓取数据实现监控))。
2. 编写脚本
vim /opt/prometheus/push_node_exporter_metrics.sh
#!/bin/bash
## 云端 pushgateway 地址
PUSHGATEWAY_SERVER=http://ip:9091
NODE_NAME=(hostname)
curl -s localhost:18381/metrics | curl --data-binary @- $PUSHGATEWAY_SERVER/metrics/job/edge_linux/instance/$NODE_NAME
3. 开启定时任务
$ sudo crontab -e
添加
# 每分钟向Prometheus Pushgateway推送节点数据
*/1 * * * * /opt/prometheus/push_node_exporter_metrics.sh &> /dev/null
3、peometheus修改配置拉取pushgateway数据
添加job
- job_name: 'edge-linux'
static_configs:
- targets: ['127.0.0.1:9091']
honor_labels: true
重启
systemctl restart prometheus
监控告警
1、下载部署alertManage
在prometheus官网
解压
tar -zxvf alertmanager-0.23.0.tar.gz
设置为服务
vim /etc/systemd/system/alertmanager.service
[root@localhost rules]# cat /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
After=network-online.target
[Service]
Restart=on-failure
ExecStart=/home/software/alertmanager-0.23.0/alertmanager --config.file=/home/software/alertmanager-0.23.0/alertmanager.yml
[Install]
WantedBy=multi-user.target
设置自启动
systemctl daemon-reload
systemctl enable alertmanage.service
配置alertmanage发送邮件
vim alertmanager.yml
global: # 全局配置
resolve_timeout: 10s # 该项表示告警标记转为恢复标记后保持多久发送恢复通知,默认为5分钟
smtp_smarthost: 'smtp.qq.com:465' # smtp服务器,如果465以及25端口报错可以尝试587端口,25端口需要将tls设置为false
smtp_from: '253237****@qq.com' # 邮件来自
smtp_auth_username: '253237****@qq.com' # 邮箱账户
smtp_auth_password: 'uolonajbvvro****' # 邮箱密码或授权码
smtp_require_tls: false # 启用TLS/SSL
templates: # 定义HTML邮件模板
- 'template/*.tmpl'
route: # route用来设置报警的分发策略
group_by: ['alertname']
group_wait: 10s # 组告警等待时间,告警产生后等待10s,如果有同组告警一起发出
group_interval: 10s # 两组告警的间隔时间
repeat_interval: 1h # 重复告警的间隔时间,减少相同邮件的发送频率
receiver: 'email' # 告警推送渠道
receivers: # 告警推送方式与渠道,route->receiver 对应 receivers->name
- name: 'email'
email_configs:
- to: 'liu****@****.com.cn' # 接收人
# html: '{{ template "test.html" . }}' # 设定邮箱的内容模板,不设置则使用默认模板
headers: { Subject: " 报警邮件"} # 接收邮件的标题
send_resolved: true # 告警消除是否通知
- name: 'web.hook'
webhook_configs:
- url: 'http://192.168.146.18:5001/'
inhibit_rules: # 告警抑制规则
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
下载告警html模板
mkdir template
cd template
wget https://raw.githubusercontent.com/prometheus/alertmanager/master/template/default.tmpl
2、配置prometheus
vim prometheus-2.33.1/prometheus.yml
# Alertmanager configuration 配置alertmanage服务
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.146.18:9093']
# - 192.168.146.18:9093
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
# 配置告警规则
rule_files:
- 'rules/*.yml'
# - "first_rules.yml"
# - "second_rules.yml"
3、创建告警规则
在prometheus下创建文件夹rules
mkdir rules
cd rules
vim rules.yml
groups:
- name: node_down #该告警分组名
rules:
- alert: 节点离线告警 # 告警名称
expr: up == 0 # 告警的判定条件,参考Prometheus高级查询来设定
for: 10s # for表示告警持续时间,超过这个时间才会发送给alertmanager
labels: # 标签项
severity: "critical"
annotations:
summary: "Instance {{ $labels.instance }} 离线" # 尽可能把详细告警信息写入summary标签
description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 10 second." # 这是一条告警描述
4、修改告警邮件模板
修改prometheus的服务配置
暴露服务器的ip给prometheus操作
vim /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/home/software/prometheus-2.33.1/prometheus --config.file=/home/software/prometheus-2.33.1/prometheus.yml --storage.tsdb.path=/home/software/prometheus-2.33.1/data --web.external-url=http://192.168.146.18:9090 #--web.external-url=http://192.168.146.18:9090为新增部分
Restart=on-failure
[Install]
WantedBy=multi-user.target
修改默认模板
vim template/default.tmpl
{{ define "__alertmanager" }}Alertmanager{{ end }}
{{ define "__alertmanagerURL" }}http://192.168.146.18:9093/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }} #这里地址改为服务器地址 注意port为alertmanage的端口
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Annotations:
{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Source: {{ .GeneratorURL }} #在前面prometheus的service修改之后这里就可以正常运行了
{{ end }}{{ end }}
{{ define "slack.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "slack.default.username" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "slack.default.fallback" }}{{ template "slack.default.title" . }} | {{ template "slack.default.titlelink" . }}{{ end }}
{{ define "slack.default.callbackid" }}{{ end }}
{{ define "slack.default.pretext" }}{{ end }}
{{ define "slack.default.titlelink" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "slack.default.iconemoji" }}{{ end }}
{{ define "slack.default.iconurl" }}{{ end }}
{{ define "slack.default.text" }}{{ end }}
{{ define "slack.default.footer" }}{{ end }}
{{ define "pagerduty.default.description" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.client" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "pagerduty.default.clientURL" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "pagerduty.default.instances" }}{{ template "__text_alert_list" . }}{{ end }}
5、测试
重启prometheus和alertmanage
systemctl restart prometheus
systemctl restart alertmanage
关掉某一个node_export
systemctl stop node_exporter.service
结果展示
重新开启该节点
systemctl start node_exporter.service
之前的alert没有了
第一封三个都是firing,第二封最后一个是resolved
grafana搭建日志收集:https://blog.csdn.net/qq_43801592/article/details/123984041?spm=1001.2014.3001.5502
grafana接入oauth2:https://blog.csdn.net/qq_43801592/article/details/123062161?spm=1001.2014.3001.5502
更多推荐
所有评论(0)