在Docker中集成GPU监控体系
先上效果图等文档整理出来再更
效果图
容器内安装组件及作用
datacenter-gpu-manager (DCGM)GPU监控信息统计
dcgm-exporter GPU监控信息对外输出
Prometheus GPU监控信息采集工具
Grafana GPU监控信息展示工具
1.准备工作
下载datacenter-gpu-manager-2.0.13-1-x86_64.rpm(需要注册nvidia开发者账号,下载地址https://developer.nvidia.com/dcgm)
下载grafana-7.3.3-1.x86_64.rpm 下载地址https://grafana.com/grafana/download
下载prometheus-2.8.1.linux-amd64.tar.gz下载地址https://prometheus.io/download/
git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
安装Go语言环境,安装GPU和GPU驱动程序以及nvidia-docker
2. 编译解压安装包
解压prometheus-2.8.1.linux-amd64.tar.gz
将prometheus promtool两个文件拷贝至Dockerfile构建的文件夹
配置GO的国内源
go env -w GOPROXY=https://goproxy.cn
进入文件夹编译
cd gpu-monitoring-tools/
make binary
make install
将/usr/bin/文件夹下的dcgm-exporter文件拷贝至Dockerfile构建的文件夹
3. 配置文件
3.1 dcgm-exporter配置文件
复制/etc/dcgm-exporter下的dcp-metrics-included.csv default-counters.csv两个文件拷贝至Dockerfile构建的文件夹下etc/dcgm-exporter/
3.2 Prometheus配置文件
在Dockerfile构建的文件夹下etc/prometheus新建prometheus.yml文件,内容如下
# global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'gpu' static_configs: - targets: ['localhost:30112'] |
其中标红部分表示监听本地的30112端口的数据
3.3 Grafana Dashboard配置文件
在本地安装grafana后,将/etc/grafana/文件夹下文件拷贝至Dockerfile构建的文件夹下etc/ grafana/
修改etc/grafana/provisioning/dashboards/sample.yaml,内容如下,该操作目的是能够使Grafana启动时就加载Dashboard,而不用手动导入,其中path为加载json的文件夹
# # config file version apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' folderUid: '' type: file options: path: /var/lib/grafana/dashboards |
在镜像制作目录下编辑Dashboard.json文件,该文件为监控面板数据源和展示效果配置文件,内容如下:
{ "annotations": { "list": [ { "$$hashKey": "object:192", "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "description": "This dashboard is to display the metrics from DCGM Exporter on a Kubernetes (1.13+) cluster", "editable": true, "gnetId": 12239, "graphTooltip": 0, "iteration": 1606283938013, "links": [], "panels": [ { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 20, "panels": [], "title": "GPU仪表盘", "type": "row" }, { "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "custom": {}, "links": [], "mappings": [ { "from": "", "id": 1, "text": "", "to": "", "type": 1 } ], "max": 100, "min": 0, "thresholds": { "mode": "percentage", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 80 }, { "color": "red", "value": 95 } ] }, "unit": "celsius" }, "overrides": [] }, "gridPos": { "h": 9, "w": 4, "x": 0, "y": 1 }, "id": 12, "options": { "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "showThresholdLabels": false, "showThresholdMarkers": true }, "pluginVersion": "7.3.3", "targets": [ { "expr": "DCGM_FI_DEV_GPU_TEMP", "instant": true, "interval": "", "legendFormat": "GPU {{gpu}}", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "GPU核心温度", "type": "gauge" }, { "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "mappings": [], "max": 105, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 85 }, { "color": "red", "value": 95 } ] }, "unit": "celsius" }, "overrides": [] }, "gridPos": { "h": 9, "w": 4, "x": 4, "y": 1 }, "id": 26, "options": { "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "showThresholdLabels": false, "showThresholdMarkers": true }, "pluginVersion": "7.3.3", "targets": [ { "expr": "DCGM_FI_DEV_MEMORY_TEMP", "interval": "", "legendFormat": "", "queryType": "randomWalk", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "GPU显存温度", "type": "gauge" }, { "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "links": [], "mappings": [], "max": 100, "min": 1, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 80 }, { "color": "red", "value": 90 } ] }, "unit": "%" }, "overrides": [] }, "gridPos": { "h": 9, "w": 4, "x": 8, "y": 1 }, "id": 6, "options": { "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "showThresholdLabels": false, "showThresholdMarkers": true }, "pluginVersion": "7.3.3", "targets": [ { "expr": "DCGM_FI_DEV_GPU_UTIL", "interval": "", "legendFormat": "GPU {{gpu}}", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "GPU平均使用率", "type": "gauge" }, { "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "links": [], "mappings": [], "max": 300, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 200 }, { "color": "red", "value": 240 } ] }, "unit": "watt" }, "overrides": [] }, "gridPos": { "h": 9, "w": 4, "x": 12, "y": 1 }, "id": 10, "options": { "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "showThresholdLabels": false, "showThresholdMarkers": true }, "pluginVersion": "7.3.3", "targets": [ { "expr": "DCGM_FI_DEV_POWER_USAGE", "interval": "", "legendFormat": "GPU {{gpu}}", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "GPU平均功耗", "type": "gauge" }, { "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "links": [], "mappings": [], "max": 2000, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 1300 }, { "color": "red", "value": 1400 } ] }, "unit": "MHz" }, "overrides": [] }, "gridPos": { "h": 9, "w": 4, "x": 16, "y": 1 }, "id": 2, "interval": "", "options": { "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "showThresholdLabels": false, "showThresholdMarkers": true }, "pluginVersion": "7.3.3", "targets": [ { "expr": "DCGM_FI_DEV_SM_CLOCK", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "GPU {{gpu}}", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "GPU核心频率", "type": "gauge" }, { "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "mappings": [], "max": 5000, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "#EAB839", "value": 3000 }, { "color": "red", "value": 4000 } ] }, "unit": "MHz" }, "overrides": [] }, "gridPos": { "h": 9, "w": 4, "x": 20, "y": 1 }, "id": 22, "options": { "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "showThresholdLabels": false, "showThresholdMarkers": true }, "pluginVersion": "7.3.3", "targets": [ { "expr": "DCGM_FI_DEV_MEM_CLOCK", "interval": "", "legendFormat": "", "queryType": "randomWalk", "refId": "A" } ], "timeFrom": null, "timeShift": null, "title": "GPU显存频率", "type": "gauge" }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "GPUPrometheus", "decimals": null, "fieldConfig": { "defaults": { "custom": {}, "unit": "%" }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 20, "w": 8, "x": 0, "y": 10 }, "hiddenSeries": false, "id": 24, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": false, "total": false, "values": false }, "lines": true, "linewidth": 1, "nullPointMode": "null", "options": { "alertThreshold": true }, "percentage": false, "pluginVersion": "7.3.3", "pointradius": 2, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "DCGM_FI_DEV_GPU_UTIL", "interval": "", "legendFormat": "", "queryType": "randomWalk", "refId": "A" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "GPU使用率", "tooltip": { "shared": true, "sort": 0, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "$$hashKey": "object:82", "format": "%", "label": null, "logBase": 1, "max": "100", "min": "0", "show": true }, { "$$hashKey": "object:83", "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "links": [], "unit": "watt" }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 20, "w": 8, "x": 8, "y": 10 }, "hiddenSeries": false, "id": 4, "legend": { "alignAsTable": true, "avg": true, "current": true, "max": true, "min": false, "rightSide": true, "show": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "nullPointMode": "null", "options": { "alertThreshold": true }, "percentage": false, "pluginVersion": "7.3.3", "pointradius": 2, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "DCGM_FI_DEV_POWER_USAGE", "interval": "", "legendFormat": "GPU {{gpu}}", "refId": "A" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "GPU功耗", "tooltip": { "shared": true, "sort": 0, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "$$hashKey": "object:159", "format": "watt", "label": null, "logBase": 1, "max": "300", "min": "0", "show": true }, { "$$hashKey": "object:160", "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }, { "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "datasource": "GPUPrometheus", "fieldConfig": { "defaults": { "custom": {}, "links": [], "unit": "mbytes" }, "overrides": [] }, "fill": 1, "fillGradient": 0, "gridPos": { "h": 20, "w": 8, "x": 16, "y": 10 }, "hiddenSeries": false, "id": 18, "legend": { "avg": true, "current": false, "max": true, "min": false, "rightSide": true, "show": true, "total": false, "values": true }, "lines": true, "linewidth": 2, "nullPointMode": "null", "options": { "alertThreshold": true }, "percentage": false, "pluginVersion": "7.3.3", "pointradius": 2, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "DCGM_FI_DEV_FB_USED", "interval": "", "legendFormat": "GPU {{gpu}}", "refId": "A" } ], "thresholds": [], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "GPU显存使用大小", "tooltip": { "shared": true, "sort": 0, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "$$hashKey": "object:72", "format": "mbytes", "label": null, "logBase": 1, "max": "16000", "min": "0", "show": true }, { "$$hashKey": "object:73", "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } } ], "refresh": "30s", "schemaVersion": 26, "style": "dark", "tags": [], "templating": { "list": [ { "allValue": null, "current": { "selected": false, "text": "0", "value": "0" }, "datasource": "GPUPrometheus", "definition": "label_values(gpu)", "error": null, "hide": 0, "includeAll": false, "label": null, "multi": false, "name": "gpu", "options": [], "query": "label_values(gpu)", "refresh": 1, "regex": "", "skipUrlSync": false, "sort": 1, "tagValuesQuery": "", "tags": [], "tagsQuery": "", "type": "query", "useTags": false } ] }, "time": { "from": "now-15m", "to": "now" }, "timepicker": { "refresh_intervals": [ "5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d" ] }, "timezone": "", "title": "GPU_Dashboard", "uid": "Oxdasd", "version": 1 } |
3.4 Grafana Datasource配置文件
修改etc/grafana/provisioning/dashboards/sample.yaml,内容如下,该操作目的是能够使Grafana启动时就加载数据源,而不用手动导入,其中name是数据源名称,type是数据源类型,url是数据源地址。
# # config file version apiVersion: 1 # # on what's available in the database datasources: # # <string, required> name of the datasource. Required - name: GPUPrometheus type: prometheus access: proxy orgId: 1 url: http://localhost:30113 version: 1 editable: false |
4.Docker镜像制作
4.1 镜像启动脚本
编写镜像启动脚本entrypoint.sh,内容如下
#!/bin/bash #启动dcgm-exporter并指定端口为30112 dcgm-exporter -a :30112 & \ #启动prometheus并指定配置文件和端口 prometheus --config.file=/etc/prometheus/prometheus.yml --web.listen-address=:30113 & \ #启动grafana指定配置文件、Pid路径、日志路径和启动参数 /usr/sbin/grafana-server \ --config=${CONF_FILE} \ --pidfile=${PID_FILE_DIR}/grafana-server.pid \ --packaging=rpm \ --homepath "/usr/share/grafana" \ cfg:default.paths.logs=${LOG_DIR} \ cfg:default.paths.data=${DATA_DIR} \ cfg:default.paths.plugins=${PLUGINS_DIR} \ cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} |
4.2 Dockerfile
Dockerfile内容如下
FROM centos:centos7.4.1708 WORKDIR /opt/ #拷贝RPM包和启动脚本 COPY entrypoint.sh datacenter-gpu-manager-2.0.13-1-x86_64.rpm grafana-7.3.3-1.x86_64.rpm /opt/ #拷贝二进制文件 COPY prometheus promtool dcgm-exporter /bin/ #执行rpm安装,清除rpm文件 RUN yum install datacenter-gpu-manager-2.0.13-1-x86_64.rpm grafana-7.3.3-1.x86_64.rpm -y && rm -rf datacenter-gpu-manager-2.0.13-1-x86_64.rpm grafana-7.3.3-1.x86_64.rpm && mkdir -p /var/lib/grafana/dashboards #导入环境变量,使Grafana支持无密码查看dashboard ENV GRAFANA_USER=grafana GRAFANA_GROUP=grafana GRAFANA_HOME=/usr/share/grafana LOG_DIR=/var/log/grafana DATA_DIR=/var/lib/grafana MAX_OPEN_FILES=10000 CONF_DIR=/etc/grafana CONF_FILE=/etc/grafana/grafana.ini RESTART_ON_UPGRADE=true PLUGINS_DIR=/var/lib/grafana/plugins PROVISIONING_CFG_DIR=/etc/grafana/provisioning PID_FILE_DIR=/var/run/grafana GF_AUTH_PROXY_ENABLED=true GF_AUTH_ANONYMOUS_ENABLED=true GF_SERVER_HTTP_PORT=30114 GF_USERS_DISABLE_LOGIN_FORM=true #拷贝配置文件 COPY etc/ /etc/ #拷贝dashboad展示页面 COPY GPU_Dashboard.json /var/lib/grafana/dashboards ENTRYPOINT ["/opt/entrypoint.sh"] |
至此,所有配置文件和软件均已准备完毕,镜像制作文件夹内容如下
执行镜像构建命令
docker build -t gpu-monitoring:v1.0 . |
更多推荐
所有评论(0)