环境

虚拟机: VirtualBox-6.1.14,单核cpu,4G内存
虚拟机OS:CentOS Linux release 7.7.1908 (Core)
安装软件:
prometheus-2.21.0.linux-amd64.tar.gz
node_exporter-1.0.1.linux-amd64.tar.gz
alertmanager-0.21.0.linux-amd64.tar.gz
redis_exporter-v1.12.0.linux-amd64.tar.gz
rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz
grafana-7.2.1-1.x86_64.rpm
操作用户:root
为缩短文档,以下只精简记录必要的部署步骤,都记录了2种启动方式(直接启动和系统启动),想进一步了解原理及各参数含义,可查看【参考】部分的链接文档。

一、裸机安装Prometheus Server

1、下载安装

# curl -OL https://github.com/prometheus/prometheus/releases/download/v2.21.0/prometheus-2.21.0.linux-amd64.tar.gz
# tar -zxvf prometheus-2.21.0.linux-amd64.tar.gz
# mkdir -p /opt/prometheus/prometheus-server
# mkdir /opt/logs/   ##创建日志存储目录
# mv prometheus-2.21.0.linux-amd64/* /opt/prometheus/prometheus-server
# cd /opt/prometheus/prometheus-server/
# mkdir data

2、配置Alert报警规则(简单示例,需要Node Exporter支持)

# mkdir /opt/prometheus/prometheus-server/rules/
# vim /opt/prometheus/prometheus-server/prometheus.yml

prometheus.yml新增以下配置:

rule_files:
  - "/opt/prometheus/prometheus-server/rules/*.yml"

在目录/opt/prometheus/prometheus-server/rules/下创建告警文件hoststats-alert.yml

# cat > /opt/prometheus/prometheus-server/rules/hoststats-alert.yml << 'EOF'
groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
EOF

注意cat <<EOF时注意$变量符丢失

其它配置可参考:https://blog.csdn.net/haohaifeng002/article/details/109223574
3、启动服务:

以下介绍2种启动方式,直接启动和服务启动
1)、直接启动
# nohup ./prometheus > /opt/logs/prometheus-9090.log 2>&1 &
可以通过参数–storage.tsdb.path="data/"修改本地数据存储的路径
修改端口可使用以下命令启动:
# nohup ./prometheus --web.listen-address=:9091 > /opt/logs/prometheus-9091.log 2>&1 &

2)、服务启动

a、创建sh脚本
#vim /opt/prometheus/prometheus-server/prometheus.sh

#!/bin/bash
/opt/prometheus/prometheus-server/prometheus --web.enable-lifecycle --config.file=/opt/prometheus/prometheus-server/prometheus.yml --web.listen-address=:9091 &>> /opt/logs/prometheus-9091.log

b、授权
# chmod +x prometheus.sh

c、配置系统启动

# cat > /etc/systemd/system/prometheus.service <<EOF
[Unit]
Description=prometheus
After=network.target 

[Service]
ExecStart=/opt/prometheus/prometheus-server/prometheus.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

d、启动服务,设置开机自启,并检查服务开启状态

#  systemctl daemon-reload
#  systemctl enable prometheus
#  systemctl start prometheus
#  systemctl status prometheus
tail -f /var/log/messages

4、访问:http://192.168.56.101:9090/graph

下图是安装了Node Exporter 访问Prometheus UI的效果
Alert预警

二、安装Node Exporter

1、下载Node Exporter

# curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
# tar -xzf node_exporter-1.0.1.linux-amd64.tar.gz
# cd node_exporter-1.0.1.linux-amd64.tar.gz/
# mv node_exporter /opt/prometheus/
# cd /opt/prometheus/

2、运行node exporter
可以直接运行,也可以以服务的方式启动并设置开机自启

  1. 直接运行
# touch /opt/logs/node_exporter-9100.log
# nohup ./node_exporter > /opt/logs/node_exporter-9100.log 2>&1 &
# netstat -anplt|grep 9100

如需修改端口可以使用如下方式启动
# nohup ./node_exporter --web.listen-address=:9900 > /opt/logs/node_exporter-9900.log 2>&1 &

2)以服务的方式启动并设置开机自启

# cat > /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=node_exporter
After=network.target 

[Service]
ExecStart=/opt/prometheus/node_exporter \
				 --web.listen-address=:9900
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动服务,设置开机自启,并检查服务开启状态

#  systemctl daemon-reload
#  systemctl enable node_exporter
#  systemctl start node_exporter
#  systemctl status node_exporter

3、访问:http://192.168.56.101:9100/

4、关联Prometheus与Node Exporter
修改Prometheus Server的配置文件prometheus.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']
  # 采集node exporter监控数据
  - job_name: 'node'
    static_configs:
      - targets: ['192.168.56.101:9100']

5、重新启动Prometheus Server
访问http://192.168.56.101:9090,进入到Prometheus Server。如果输入“up”并且点击执行按钮以后,可以看到如下结果

up{instance=“127.0.0.1:9090”,job=“prometheus”} 1
up{instance=“192.168.56.101:9100”,job=“node”} 1

其中“1”表示正常,反之“0”则为异常。

三、裸机安装alertmanager

1、下载安装

# curl -LO https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
# tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz
# mkdir -p /opt/prometheus/alertmanager/data/
# mv alertmanager-0.21.0.linux-amd64/*  /opt/prometheus/alertmanager/
# cd /opt/prometheus/alertmanager/

2、启动邮件通知
# vim /opt/prometheus/alertmanager/alertmanager.yml
配置如下内容

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qiye.aliyun.com:465'        #此处使用阿里云邮箱,虽然可使用25端口,但此处只能用465端口TLS加密连接
  smtp_from: 'xxx@xxx.com'
  smtp_auth_username:  'xxx@xxx.com'
  smtp_auth_password: xxxxxx
  smtp_require_tls: false 							#此处使用阿里云邮箱,该配置不可缺少

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'mail-receiver'

receivers:
  - name: 'mail-receiver'
    email_configs:
      - to: xxx@xxx.com
        send_resolved: true
		
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

也可以在启动Alertmanager时使用参数修改相关配置。–config.file用于指定alertmanager配置文件路径,–storage.path用于指定数据存储路径。

以下记录2种启动方式,直接启动和系统启动
1)直接启动:# nohup ./alertmanager > /opt/logs/alertmanager-9093.log 2>&1 &
如需修改端口启动:
# nohup ./alertmanager --web.listen-address=:9039 > /opt/logs/alertmanager-9039.log 2>&1 &

2)系统启动

# cat > /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=alertmanager
After=network.target 

[Service]
ExecStart=/opt/prometheus/alertmanager/alertmanager  --web.listen-address=:9039 --config.file=/opt/prometheus/alertmanager/alertmanager.yml &>> /opt/logs/alertmanager-9039.log
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动服务,设置开机自启,并检查服务开启状态

#  systemctl daemon-reload
#  systemctl enable alertmanager
#  systemctl start alertmanager
#  systemctl status alertmanager

3)访问:http://192.168.56.101:9093

3、关联Prometheus与Alertmanager
编辑Prometheus配置文件prometheus.yml,并添加以下内容

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['http://192.168.56.101:9093']

重启Prometheus服务,成功后可以从http://192.168.56.101:9090/config查看alerting配置是否生效

此时,再次尝试手动拉高系统CPU使用率(多核cpu可启动多个用户执行命令):
# cat /dev/zero>/dev/null
等待Prometheus告警进行触发状态

四、安装Grafana

1、裸机安装

# wget https://dl.grafana.com/oss/release/grafana-7.2.1-1.x86_64.rpm
# yum install grafana-7.2.1-1.x86_64.rpm

如果是docker :# docker run -d -p 3000:3000 grafana/grafana

2、启动Grafana服务,设置开机自启,并检查服务开启状态

# sudo systemctl daemon-reload
# sudo systemctl start grafana-server
# sudo systemctl status grafana-server
# sudo systemctl enable grafana-server

如需要修改默认3000端口
# vim /etc/grafana/grafana.ini
修改http_port = 3030
重启服务:# systemctl restart grafana-server
查看启动情况:# systemctl status grafana-server

3、关联Prometheus与Grafana
编辑prometheus.yml并在scrape_configs节点下添加以下内容:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']
  # 采集node exporter监控数据
  - job_name: 'node'
    static_configs:
      - targets: ['192.168.56.101:9100']
  # 采集grafana监控数据
  - job_name: 'grafana'
    static_configs:
      - targets: ['192.168.56.101:3000']

重启Prometheus

# cd /opt/prometheus/prometheus-server/
# ps aux|grep prometheus
# kill -9 [prometheus的pid]
# nohup ./prometheus > /opt/logs/prometheus-9090.log 2>&1 &

4、访问:http://192.168.56.101:3000 admin/admin登录

5、可下载开源的Node Exporter的Dashboard模板做参考:
https://grafana.com/grafana/dashboards?dataSource=prometheus
参考:
https://grafana.com/grafana/dashboards/8919
https://grafana.com/grafana/dashboards/11559
6、配置Grafana数据源及其它

五、部署Redis Exporter,监控redis主节点6379

1、下载安装

# cul -OL https://github.com/oliver006/redis_exporter/releases/download/v1.12.0/redis_exporter-v1.12.0.linux-amd64.tar.gz
# tar -xvf  redis_exporter-v1.12.0.linux-amd64.tar.gz
# cd redis_exporter-v1.12.0.linux-amd64
# mv redis_exporter-v1.12.0.linux-amd64 /opt/prometheus/
# cd /opt/prometheus/

可直接启动,也可以配置systemctl服务实现开机自启,Redis Exporter启动后默认端口为9121

(1) 直接启动
# ./redis_exporter -redis.addr 127.0.0.1:6379 -redis.password 123456
(2) 配置服务启动

cat > /etc/systemd/system/redis_exporter.service <<EOF
[Unit]
Description=redis_exporter
After=network.target 

[Service]
ExecStart=/opt/prometheus/redis_exporter -redis.addr  127.0.0.1:6379  -redis.password 123456
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动服务,设置开机自启,并检查服务开启状态

# systemctl daemon-reload
# systemctl start redis_exporter
# systemctl status redis_exporter
# systemctl enable redis_exporter
# systemctl list-units --type=service|grep redis

3、配置prometheus.yml

 - job_name: redis
    static_configs:
      - targets: ['127.0.0.1:9121']
        labels:
          instance: redis_sentinel

4、重启Prometheus

5、下载grafana json配置文件导入grafana中:
# wget https://grafana.com/grafana/dashboards/11835

六、安装rabbitmq_exporter

1、安装

# curl -OL https://github.com/kbudde/rabbitmq_exporter/releases/download/v1.0.0-RC7/rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz
# tar zxvf rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz
# cd rabbitmq_exporter-1.0.0-RC7.linux-amd64
# mv rabbitmq_exporter /opt/prometheus/
# cd /opt/prometheus/

以下介绍两种启动方式,直接启动和配置系统服务启动
首先在rabbitmq_exporter同级目录下创建配置文件 rabbitmq_exporter_config.json
文件内容(参考:https://github.com/kbudde/rabbitmq_exporter/blob/master/config.example.json)如下:

{
    "rabbit_url": "http://127.0.0.1:15672",
    "rabbit_user": "guest",
    "rabbit_pass": "123456",
    "publish_port": "9099",
    "publish_addr": "",
    "output_format": "TTY",
    "ca_file": "ca.pem",
    "cert_file": "client-cert.pem",
    "key_file": "client-key.pem",
    "insecure_skip_verify": false,
    "exlude_metrics": [],
    "include_queues": ".*",
    "skip_queues": "^$",
    "skip_vhost": "^$",
    "include_vhost": ".*",
    "rabbit_capabilities": "no_sort,bert",
    "enabled_exporters": [
            "exchange",
            "node",
            "overview",
            "queue"
    ],
    "timeout": 30,
    "max_queues": 0
}

1)启动命令:
# RABBIT_USER=guest RABBIT_PASSWORD=123456 OUTPUT_FORMAT=JSON PUBLISH_PORT=9099 RABBIT_URL=http://127.0.0.1:15672 nohup ./rabbitmq_exporter > /opt/logs/rabbitmq-9099.log 2>&1 &

# nohup ./rabbitmq_exporter -config-file rabbitmq_exporter_config.json > /opt/logs/rabbitmq-9099.log 2>&1 &

2)配置系统服务启动:

# cat > /etc/systemd/system/rabbitmq_exporter.service <<EOF
[Unit]
Description=rabbitmq_exporter
After=network.target 

[Service]
ExecStart=/opt/prometheus/rabbitmq_exporter -config-file /opt/prometheus/rabbitmq_exporter_config.json
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动服务,设置开机自启,并检查服务开启状态

# systemctl daemon-reload
# systemctl start rabbitmq_exporter
# systemctl status rabbitmq_exporter
# systemctl enable rabbitmq_exporter

启动查看:
# tail -f /var/log/messages
# systemctl status rabbitmq_exporter

启动验证
# curl -i 127.0.0.1:9099/metrics

2、配置prometheus.yml

  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['127.0.0.1:9099']
        labels:
          instance: rabbitmq

3、重启prometheus

# ps aux|grep prometheus
# kill -9 [pid]
# nohup ./prometheus > /opt/logs/prometheus-9090.log 2>&1 &

4、下载dashboard json 导入grafana
https://grafana.com/dashboards/2121
https://grafana.com/grafana/dashboards/4371

七、常见异常

邮件预警配置时遇到的错误记录如下

配置
smtp.winchannel.net:25
报错:
level=error ts=2020-04-08T06:02:44.036Z caller=notify.go:372 component=dispatcher msg=“Error on notify” err=“send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.winchannel.net” context_err=“context deadline exceeded”
level=error ts=2020-04-08T06:02:44.036Z caller=dispatch.go:301 component=dispatcher msg=“Notify for alerts failed” num_alerts=1 err=“send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.winchannel.net”

配置
smtp.winchannel.net
smtp_require_tls:false
报错:
level=warn ts=2020-10-12T10:34:11.780Z caller=notify.go:674 component=dispatcher receiver=mail-receiver integration=email[0] msg=“Notify attempt failed, will retry later” attempts=1 err="*smtp.plainAuth auth: unencrypted connection"
level=error ts=2020-10-12T10:34:21.581Z caller=dispatch.go:309 component=dispatcher msg=“Notify for alerts failed” num_alerts=1 err=“mail-receiver/email[0]: notify retry canceled after 7 attempts: *smtp.plainAuth auth: unencrypted connection”

配置smtp.qiye.aliyun.com:465
报错:
level=warn ts=2020-10-12T11:36:41.779Z caller=notify.go:674 component=dispatcher receiver=mail-receiver integration=email[0] msg=“Notify attempt failed, will retry later” attempts=1 err="‘require_tls’ is true (default) but “smtp.qiye.aliyun.com:465” does not advertise the STARTTLS extension"
level=error ts=2020-10-12T11:36:51.578Z caller=dispatch.go:309 component=dispatcher msg=“Notify for alerts failed” num_alerts=1 err=“mail-receiver/email[0]: notify retry canceled after 8 attempts: ‘require_tls’ is true (default) but “smtp.qiye.aliyun.com:465” does not advertise the STARTTLS extension”
tail: /opt/logs/alertmanager-9093.log: file truncated

配置以下两行发送邮件正常
smtp.qiye.aliyun.com:465
smtp_require_tls: false

参考:
[1]https://www.prometheus.wang/quickstart/why-monitor.html
[2]https://prometheus.io/
[3]https://github.com/prometheus/
[4]https://www.cnblogs.com/gered/p/13535212.html
[6]https://github.com/free/sql_exporter/releases/tag/0.5
[7]https://github.com/oliver006/redis_exporter/releases/tag/v1.12.0
[8]https://my.oschina.net/yugj/blog/3056695
[9]https://juejin.im/post/6844903793977458695
[10]https://github.com/a4sh3u/eureka_exporter
[11]https://github.com/Mautu/eureka_exporter
[12]https://developer.ibm.com/zh/depmodels/cloud/articles/cl-lo-prometheus-getting-started-and-practice/
[13]https://github.com/kbudde/rabbitmq_exporter
[14]https://blog.csdn.net/yaomingyang/article/details/104037083
[15]https://github.com/prometheus/alertmanager
[16]https://grafana.com/grafana/dashboards
[17]https://grafana.com/grafana/download?platform=linux
[18]https://grafana.com/docs/grafana/latest/administration/configure-docker/
[19]https://grafana.com/docs/grafana/latest/installation/rpm/#2-start-the-server
[20]https://blog.csdn.net/weixin_44723434/article/details/89237202
[21]https://grafana.com/docs/loki/latest/installation/local/
[22]https://grafana.com/docs/loki/latest/getting-started/get-logs-into-loki/
[23]https://github.com/grafana/loki/issues/2736
[24]https://grafana.com/docs/loki/latest/getting-started/troubleshooting/
[25]https://www.cnblogs.com/shhnwangjian/p/6879683.html
[26]https://blog.csdn.net/li4528503/article/details/106709682?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param

Logo

更多推荐