背景

        监控告警体系一直是运维工作的重中之重,万物无定型,技术的选择也是遵循这个道理,在综合当前公司的实际情况下,国产监控热门只选Nightingale就自然而然的成为了最佳的选择。

        当前公司服务大部分运行在IDC机房中,并且存在多个机房,也有部分云服务,部署方式既有主机部署的、Docker部署的,也有部署在K8S集群中的,情况复杂多样。

部署环境

操作系统:centos 7.9

Mysql:8

Redis:6.2

Nightingale:6.5.0

Nginx:1.25.0

Victoriametrics:v1.79.12

Categraf:v0.3.38

ibex:v0.5.0

Server端数据路径:/data

Categraf部署路径:/usr/local/categraf

架构图

以A机房为中心节点,部署Nightingale的主程序以及相依赖的Mysql和Redis,agent端采用Categraf,存储统一采用Victoriametrics;A机房的Categraf直接通过Nightingal主程序接口上传监控数据

B机房为边缘节点,采用categraf+n9e+Victoriametrics结构,可独立告警,告警信息汇总至中心节点

中心节点-A机房

categraf:负责采集指标数据并上传至n9e,数据写入路径为categraf --> Nginx代理 --> n9e --> victoriametrics;另外ibex的agent也合并至了categraf,categraf也支持远程执行脚本,执行请求路径为n9e --> ibex server --> categraf

n9e:Nightingale的主程序,提供web页面,核心依赖组件为mysql和redis

victoriametrics:时序存储,用来存储监控指标数据,兼容Prometheus,但性能较Prometheus有较大改善

ibex:提供远程执行脚本功能,以API方式提供服务

Nginx:n9e的反向代理,采用域名通信

核心服务部署清单

IP地址

配置

服务

版本

部署方式

故障重启

10.20.18.5

32C/64G/500G

Mysql

8

Docker

支持

Redis

6.2

Docker

支持

ibex

v0.5.0

Docker

支持

nightingale

6.4.0

Docker

支持

victoriametrics

v1.79.12

Docker

支持

10.20.18.6

8C/16G/500G

Nginx

1.25.0-alpine

Docker

支持

边缘节点-B机房

n9e-edge: 与主程序n9e进行通信,主要实现告警和数据上传功能

安装部署

Server

中心节点docker-compose.yaml

version: "3.3"

services:
  mysql:
    image: "mysql:8"
    container_name: mysql
    hostname: mysql
    restart: always
    environment:
      TZ: Asia/Shanghai
      MYSQL_ROOT_PASSWORD: 672ANVJf
    volumes:
      - /data/mysql/data:/var/lib/mysql/
      - /data/mysql/conf/my.cnf:/etc/my.cnf
    network_mode: host

  redis:
    image: "redis:6.2"
    container_name: redis
    hostname: redis
    restart: always
    volumes:
      - /data/redis/data:/data
    environment:
      TZ: Asia/Shanghai
    command: redis-server --appendonly yes --requirepass 672ANVJf
    network_mode: host

  n9e:
    image: flashcatcloud/nightingale:6.5.0
    container_name: n9e
    hostname: n9e
    restart: always
    environment:
      GIN_MODE: release
      TZ: Asia/Shanghai
      WAIT_HOSTS: 10.20.18.5:3306, 10.20.18.5:6379
    volumes:
      - /data/n9e/conf:/app/etc
    network_mode: host
    depends_on:
      - mysql
      - redis
    command: >
      sh -c "/app/n9e"
  ibex:
    image: flashcatcloud/ibex:v0.5.0
    container_name: ibex
    hostname: ibex
    restart: always
    environment:
      GIN_MODE: release
      TZ: Asia/Shanghai
      WAIT_HOSTS: 10.20.18.5:3306
    volumes:
      - /data/ibex:/app/etc
    network_mode: host
    depends_on:
      - mysql
    command: >
      sh -c "/app/ibex server"

Victoriametrics docker-compose.yaml

version: "3.3"

services:
  victoriametrics:
    image: victoriametrics/victoria-metrics:v1.79.12
    container_name: victoriametrics
    hostname: victoriametrics
    restart: always
    volumes:
      - /data/victoriametrics/data:/victoria-metrics-data
    environment:
      TZ: Asia/Shanghai
    network_mode: host
    command:
      - "--loggerTimezone=Asia/Shanghai"

Nginx

Nginx docker-compose.yaml

version: '3.3'
services:
  nginx:
    image: nginx:1.25.0-alpine
    restart: always
    hostname: nginx
    container_name: nginx
    privileged: true
    ports:
      - 80:80
      - 443:443
    volumes:
      - /data/nginx/conf/:/etc/nginx/                   # 配置文件
      - /data/nginx/cert/:/etc/cert/                    # ssl证书
      - /data/nginx/logs/:/var/log/nginx/               # 日志文件

n9e.conf配置

upstream n9e {
  server 10.20.18.5:17000;
}

server {
    listen 80;
    server_name n9e.5i5j.com;
    location / {
        proxy_pass http://n9e;
        access_log /var/log/nginx/n9e.5i5j.com.access.log;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

N9e-edge

官方并没有n9e-edge的docker image,所以需要下载安装包后自己封装

n9e-edge Dockerfile

FROM ubuntu:21.04
WORKDIR /app
ADD n9e-edge /app
ADD edge /app/edge
RUN chmod +x n9e-edge
EXPOSE  19000
CMD ["./n9e-edge","--configs","edge"]

n9e-edge docker-compose.yaml
version: "3.3"

services:
  n9e-edge:
    image: n9e-edge:6.5.0
    container_name: n9e-edge
    hostname: n9e-edge
    restart: always
    volumes:
      - /data/n9e-edge/edge:/app/edge
    environment:
      TZ: Asia/Shanghai
    network_mode: host

Categraf

安装包

📎categraf-v0.3.38-linux-amd64.tar.gz

部署路径

/usr/local/categraf

config.toml

[global]
# whether print configs
print_configs = false

# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = ""

# will not add label(agent_hostname) if true
omit_hostname = false

# global collect interval, unit: second
interval = 15

# input provider settings; optional: local / http
providers = ["local"]

# The concurrency setting controls the number of concurrent tasks spawned for each input. 
# By default, it is set to runtime.NumCPU() * 10. This setting is particularly useful when dealing
# with configurations that involve extensive instances of input like ping, net_response, or http_response.
# As multiple goroutines run simultaneously, the "ResponseTime" metric might appear larger than expected. 
# However, utilizing the concurrency setting can help mitigate this issue and optimize the response time.
concurrency = -1

[global.labels]
#region = "北京-QH"
# env = "localhost"

[log]
# file_name is the file to write logs to
file_name = "stdout"

# options below will not be work when file_name is stdout or stderr
# max_size is the maximum size in megabytes of the log file before it gets rotated. It defaults to 100 megabytes.
max_size = 100
# max_age is the maximum number of days to retain old log files based on the timestamp encoded in their filename.  
max_age = 1
# max_backups is the maximum number of old log files to retain.  
max_backups = 1
# local_time determines if the time used for formatting the timestamps in backup files is the computer's local time.  
local_time = true
# Compress determines if the rotated log files should be compressed using gzip. 
compress = false

[writer_opt]
batch = 1000
chan_size = 1000000

[[writers]]
url = "http://n9e.5i5j.com/prometheus/v1/write"

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"

[ibex]
enable = true
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["10.20.18.5:20090"]
## temp script dir
meta_dir = "./meta"

[heartbeat]
enable = true

# report os version cpu.util mem.util metadata
url = "http://n9e.5i5j.com/v1/n9e/heartbeat"

# interval, unit: s
interval = 10

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

[prometheus]
enable = false
scrape_config_file = "/path/to/in_cluster_scrape.yaml"
## log level, debug warn info error
log_level = "info"
## wal file storage path ,default ./data-agent
# wal_storage_path = "/path/to/storage"
## wal reserve time duration, default value is 2 hour
# wal_min_duration = 2

启动命令

# 初始化
/usr/local/bin/categraf  --install

# 启动
/usr/local/bin/categraf  --start

# 停止
/usr/local/bin/categraf  --stop
Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐