Nightingale部署方案
另外ibex的agent也合并至了categraf,categraf也支持远程执行脚本,执行请求路径为n9e --> ibex server --> categraf。监控告警体系一直是运维工作的重中之重,万物无定型,技术的选择也是遵循这个道理,在综合当前公司的实际情况下,国产监控热门只选Nightingale就自然而然的成为了最佳的选择。当前公司服务大部分运行在IDC机房中,并且存在多个机房,也
背景
监控告警体系一直是运维工作的重中之重,万物无定型,技术的选择也是遵循这个道理,在综合当前公司的实际情况下,国产监控热门只选Nightingale就自然而然的成为了最佳的选择。
当前公司服务大部分运行在IDC机房中,并且存在多个机房,也有部分云服务,部署方式既有主机部署的、Docker部署的,也有部署在K8S集群中的,情况复杂多样。
部署环境
操作系统:centos 7.9
Mysql:8
Redis:6.2
Nightingale:6.5.0
Nginx:1.25.0
Victoriametrics:v1.79.12
Categraf:v0.3.38
ibex:v0.5.0
Server端数据路径:/data
Categraf部署路径:/usr/local/categraf
架构图
以A机房为中心节点,部署Nightingale的主程序以及相依赖的Mysql和Redis,agent端采用Categraf,存储统一采用Victoriametrics;A机房的Categraf直接通过Nightingal主程序接口上传监控数据
B机房为边缘节点,采用categraf+n9e+Victoriametrics结构,可独立告警,告警信息汇总至中心节点
中心节点-A机房
categraf:负责采集指标数据并上传至n9e,数据写入路径为categraf --> Nginx代理 --> n9e --> victoriametrics;另外ibex的agent也合并至了categraf,categraf也支持远程执行脚本,执行请求路径为n9e --> ibex server --> categraf
n9e:Nightingale的主程序,提供web页面,核心依赖组件为mysql和redis
victoriametrics:时序存储,用来存储监控指标数据,兼容Prometheus,但性能较Prometheus有较大改善
ibex:提供远程执行脚本功能,以API方式提供服务
Nginx:n9e的反向代理,采用域名通信
核心服务部署清单
IP地址 | 配置 | 服务 | 版本 | 部署方式 | 故障重启 |
10.20.18.5 | 32C/64G/500G | Mysql | 8 | Docker | 支持 |
Redis | 6.2 | Docker | 支持 | ||
ibex | v0.5.0 | Docker | 支持 | ||
nightingale | 6.4.0 | Docker | 支持 | ||
victoriametrics | v1.79.12 | Docker | 支持 | ||
10.20.18.6 | 8C/16G/500G | Nginx | 1.25.0-alpine | Docker | 支持 |
边缘节点-B机房
n9e-edge: 与主程序n9e进行通信,主要实现告警和数据上传功能
安装部署
Server
中心节点docker-compose.yaml
version: "3.3"
services:
mysql:
image: "mysql:8"
container_name: mysql
hostname: mysql
restart: always
environment:
TZ: Asia/Shanghai
MYSQL_ROOT_PASSWORD: 672ANVJf
volumes:
- /data/mysql/data:/var/lib/mysql/
- /data/mysql/conf/my.cnf:/etc/my.cnf
network_mode: host
redis:
image: "redis:6.2"
container_name: redis
hostname: redis
restart: always
volumes:
- /data/redis/data:/data
environment:
TZ: Asia/Shanghai
command: redis-server --appendonly yes --requirepass 672ANVJf
network_mode: host
n9e:
image: flashcatcloud/nightingale:6.5.0
container_name: n9e
hostname: n9e
restart: always
environment:
GIN_MODE: release
TZ: Asia/Shanghai
WAIT_HOSTS: 10.20.18.5:3306, 10.20.18.5:6379
volumes:
- /data/n9e/conf:/app/etc
network_mode: host
depends_on:
- mysql
- redis
command: >
sh -c "/app/n9e"
ibex:
image: flashcatcloud/ibex:v0.5.0
container_name: ibex
hostname: ibex
restart: always
environment:
GIN_MODE: release
TZ: Asia/Shanghai
WAIT_HOSTS: 10.20.18.5:3306
volumes:
- /data/ibex:/app/etc
network_mode: host
depends_on:
- mysql
command: >
sh -c "/app/ibex server"
Victoriametrics docker-compose.yaml
version: "3.3"
services:
victoriametrics:
image: victoriametrics/victoria-metrics:v1.79.12
container_name: victoriametrics
hostname: victoriametrics
restart: always
volumes:
- /data/victoriametrics/data:/victoria-metrics-data
environment:
TZ: Asia/Shanghai
network_mode: host
command:
- "--loggerTimezone=Asia/Shanghai"
Nginx
Nginx docker-compose.yaml
version: '3.3'
services:
nginx:
image: nginx:1.25.0-alpine
restart: always
hostname: nginx
container_name: nginx
privileged: true
ports:
- 80:80
- 443:443
volumes:
- /data/nginx/conf/:/etc/nginx/ # 配置文件
- /data/nginx/cert/:/etc/cert/ # ssl证书
- /data/nginx/logs/:/var/log/nginx/ # 日志文件
n9e.conf配置
upstream n9e {
server 10.20.18.5:17000;
}
server {
listen 80;
server_name n9e.5i5j.com;
location / {
proxy_pass http://n9e;
access_log /var/log/nginx/n9e.5i5j.com.access.log;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
N9e-edge
官方并没有n9e-edge的docker image,所以需要下载安装包后自己封装
n9e-edge Dockerfile
FROM ubuntu:21.04
WORKDIR /app
ADD n9e-edge /app
ADD edge /app/edge
RUN chmod +x n9e-edge
EXPOSE 19000
CMD ["./n9e-edge","--configs","edge"]
n9e-edge docker-compose.yaml
version: "3.3"
services:
n9e-edge:
image: n9e-edge:6.5.0
container_name: n9e-edge
hostname: n9e-edge
restart: always
volumes:
- /data/n9e-edge/edge:/app/edge
environment:
TZ: Asia/Shanghai
network_mode: host
Categraf
安装包
📎categraf-v0.3.38-linux-amd64.tar.gz
部署路径
/usr/local/categraf
config.toml
[global]
# whether print configs
print_configs = false
# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = ""
# will not add label(agent_hostname) if true
omit_hostname = false
# global collect interval, unit: second
interval = 15
# input provider settings; optional: local / http
providers = ["local"]
# The concurrency setting controls the number of concurrent tasks spawned for each input.
# By default, it is set to runtime.NumCPU() * 10. This setting is particularly useful when dealing
# with configurations that involve extensive instances of input like ping, net_response, or http_response.
# As multiple goroutines run simultaneously, the "ResponseTime" metric might appear larger than expected.
# However, utilizing the concurrency setting can help mitigate this issue and optimize the response time.
concurrency = -1
[global.labels]
#region = "北京-QH"
# env = "localhost"
[log]
# file_name is the file to write logs to
file_name = "stdout"
# options below will not be work when file_name is stdout or stderr
# max_size is the maximum size in megabytes of the log file before it gets rotated. It defaults to 100 megabytes.
max_size = 100
# max_age is the maximum number of days to retain old log files based on the timestamp encoded in their filename.
max_age = 1
# max_backups is the maximum number of old log files to retain.
max_backups = 1
# local_time determines if the time used for formatting the timestamps in backup files is the computer's local time.
local_time = true
# Compress determines if the rotated log files should be compressed using gzip.
compress = false
[writer_opt]
batch = 1000
chan_size = 1000000
[[writers]]
url = "http://n9e.5i5j.com/prometheus/v1/write"
# Basic auth username
basic_auth_user = ""
# Basic auth password
basic_auth_pass = ""
## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]
# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"
[ibex]
enable = true
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["10.20.18.5:20090"]
## temp script dir
meta_dir = "./meta"
[heartbeat]
enable = true
# report os version cpu.util mem.util metadata
url = "http://n9e.5i5j.com/v1/n9e/heartbeat"
# interval, unit: s
interval = 10
# Basic auth username
basic_auth_user = ""
# Basic auth password
basic_auth_pass = ""
## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]
# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
[prometheus]
enable = false
scrape_config_file = "/path/to/in_cluster_scrape.yaml"
## log level, debug warn info error
log_level = "info"
## wal file storage path ,default ./data-agent
# wal_storage_path = "/path/to/storage"
## wal reserve time duration, default value is 2 hour
# wal_min_duration = 2
启动命令
# 初始化
/usr/local/bin/categraf --install
# 启动
/usr/local/bin/categraf --start
# 停止
/usr/local/bin/categraf --stop
更多推荐
所有评论(0)