效果图

容器内安装组件及作用

datacenter-gpu-manager (DCGM)GPU监控信息统计

dcgm-exporter  GPU监控信息对外输出

Prometheus    GPU监控信息采集工具  

Grafana       GPU监控信息展示工具   

1.准备工作

下载datacenter-gpu-manager-2.0.13-1-x86_64.rpm(需要注册nvidia开发者账号,下载地址https://developer.nvidia.com/dcgm

下载grafana-7.3.3-1.x86_64.rpm        下载地址https://grafana.com/grafana/download

下载prometheus-2.8.1.linux-amd64.tar.gz下载地址https://prometheus.io/download/

git clone https://github.com/NVIDIA/gpu-monitoring-tools.git

安装Go语言环境,安装GPU和GPU驱动程序以及nvidia-docker

2. 编译解压安装包

解压prometheus-2.8.1.linux-amd64.tar.gz

将prometheus  promtool两个文件拷贝至Dockerfile构建的文件夹

配置GO的国内源

go env -w GOPROXY=https://goproxy.cn

进入文件夹编译

cd gpu-monitoring-tools/

make binary

make install

将/usr/bin/文件夹下的dcgm-exporter文件拷贝至Dockerfile构建的文件夹

3. 配置文件

3.1 dcgm-exporter配置文件

复制/etc/dcgm-exporter下的dcp-metrics-included.csv default-counters.csv两个文件拷贝至Dockerfile构建的文件夹下etc/dcgm-exporter/

3.2 Prometheus配置文件

在Dockerfile构建的文件夹下etc/prometheus新建prometheus.yml文件,内容如下

# global config

global:

  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

  alertmanagers:

  - static_configs:

    - targets:

      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

  # - "first_rules.yml"

  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'gpu'

    static_configs:

    - targets: ['localhost:30112']

其中标红部分表示监听本地的30112端口的数据

3.3 Grafana Dashboard配置文件

在本地安装grafana后,将/etc/grafana/文件夹下文件拷贝至Dockerfile构建的文件夹下etc/ grafana/

修改etc/grafana/provisioning/dashboards/sample.yaml,内容如下,该操作目的是能够使Grafana启动时就加载Dashboard,而不用手动导入,其中path为加载json的文件夹

# # config file version

apiVersion: 1

providers:

 - name: 'default'

   orgId: 1

   folder: ''

   folderUid: ''

   type: file

   options:

     path: /var/lib/grafana/dashboards

在镜像制作目录下编辑Dashboard.json文件,该文件为监控面板数据源和展示效果配置文件,内容如下:

{

  "annotations": {

    "list": [

      {

        "$$hashKey": "object:192",

        "builtIn": 1,

        "datasource": "-- Grafana --",

        "enable": true,

        "hide": true,

        "iconColor": "rgba(0, 211, 255, 1)",

        "name": "Annotations & Alerts",

        "type": "dashboard"

      }

    ]

  },

  "description": "This dashboard is to display the metrics from DCGM Exporter on a Kubernetes (1.13+) cluster",

  "editable": true,

  "gnetId": 12239,

  "graphTooltip": 0,

  "iteration": 1606283938013,

  "links": [],

  "panels": [

    {

      "collapsed": false,

      "datasource": null,

      "gridPos": {

        "h": 1,

        "w": 24,

        "x": 0,

        "y": 0

      },

      "id": 20,

      "panels": [],

      "title": "GPU仪表盘",

      "type": "row"

    },

    {

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "color": {

            "mode": "thresholds"

          },

          "custom": {},

          "links": [],

          "mappings": [

            {

              "from": "",

              "id": 1,

              "text": "",

              "to": "",

              "type": 1

            }

          ],

          "max": 100,

          "min": 0,

          "thresholds": {

            "mode": "percentage",

            "steps": [

              {

                "color": "green",

                "value": null

              },

              {

                "color": "#EAB839",

                "value": 80

              },

              {

                "color": "red",

                "value": 95

              }

            ]

          },

          "unit": "celsius"

        },

        "overrides": []

      },

      "gridPos": {

        "h": 9,

        "w": 4,

        "x": 0,

        "y": 1

      },

      "id": 12,

      "options": {

        "reduceOptions": {

          "calcs": [

            "mean"

          ],

          "fields": "",

          "values": false

        },

        "showThresholdLabels": false,

        "showThresholdMarkers": true

      },

      "pluginVersion": "7.3.3",

      "targets": [

        {

          "expr": "DCGM_FI_DEV_GPU_TEMP",

          "instant": true,

          "interval": "",

          "legendFormat": "GPU {{gpu}}",

          "refId": "A"

        }

      ],

      "timeFrom": null,

      "timeShift": null,

      "title": "GPU核心温度",

      "type": "gauge"

    },

    {

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "mappings": [],

          "max": 105,

          "min": 0,

          "thresholds": {

            "mode": "absolute",

            "steps": [

              {

                "color": "green",

                "value": null

              },

              {

                "color": "#EAB839",

                "value": 85

              },

              {

                "color": "red",

                "value": 95

              }

            ]

          },

          "unit": "celsius"

        },

        "overrides": []

      },

      "gridPos": {

        "h": 9,

        "w": 4,

        "x": 4,

        "y": 1

      },

      "id": 26,

      "options": {

        "reduceOptions": {

          "calcs": [

            "mean"

          ],

          "fields": "",

          "values": false

        },

        "showThresholdLabels": false,

        "showThresholdMarkers": true

      },

      "pluginVersion": "7.3.3",

      "targets": [

        {

          "expr": "DCGM_FI_DEV_MEMORY_TEMP",

          "interval": "",

          "legendFormat": "",

          "queryType": "randomWalk",

          "refId": "A"

        }

      ],

      "timeFrom": null,

      "timeShift": null,

      "title": "GPU显存温度",

      "type": "gauge"

    },

    {

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "links": [],

          "mappings": [],

          "max": 100,

          "min": 1,

          "thresholds": {

            "mode": "absolute",

            "steps": [

              {

                "color": "green",

                "value": null

              },

              {

                "color": "#EAB839",

                "value": 80

              },

              {

                "color": "red",

                "value": 90

              }

            ]

          },

          "unit": "%"

        },

        "overrides": []

      },

      "gridPos": {

        "h": 9,

        "w": 4,

        "x": 8,

        "y": 1

      },

      "id": 6,

      "options": {

        "reduceOptions": {

          "calcs": [

            "mean"

          ],

          "fields": "",

          "values": false

        },

        "showThresholdLabels": false,

        "showThresholdMarkers": true

      },

      "pluginVersion": "7.3.3",

      "targets": [

        {

          "expr": "DCGM_FI_DEV_GPU_UTIL",

          "interval": "",

          "legendFormat": "GPU {{gpu}}",

          "refId": "A"

        }

      ],

      "timeFrom": null,

      "timeShift": null,

      "title": "GPU平均使用率",

      "type": "gauge"

    },

    {

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "links": [],

          "mappings": [],

          "max": 300,

          "min": 0,

          "thresholds": {

            "mode": "absolute",

            "steps": [

              {

                "color": "green",

                "value": null

              },

              {

                "color": "#EAB839",

                "value": 200

              },

              {

                "color": "red",

                "value": 240

              }

            ]

          },

          "unit": "watt"

        },

        "overrides": []

      },

      "gridPos": {

        "h": 9,

        "w": 4,

        "x": 12,

        "y": 1

      },

      "id": 10,

      "options": {

        "reduceOptions": {

          "calcs": [

            "mean"

          ],

          "fields": "",

          "values": false

        },

        "showThresholdLabels": false,

        "showThresholdMarkers": true

      },

      "pluginVersion": "7.3.3",

      "targets": [

        {

          "expr": "DCGM_FI_DEV_POWER_USAGE",

          "interval": "",

          "legendFormat": "GPU {{gpu}}",

          "refId": "A"

        }

      ],

      "timeFrom": null,

      "timeShift": null,

      "title": "GPU平均功耗",

      "type": "gauge"

    },

    {

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "links": [],

          "mappings": [],

          "max": 2000,

          "min": 0,

          "thresholds": {

            "mode": "absolute",

            "steps": [

              {

                "color": "green",

                "value": null

              },

              {

                "color": "#EAB839",

                "value": 1300

              },

              {

                "color": "red",

                "value": 1400

              }

            ]

          },

          "unit": "MHz"

        },

        "overrides": []

      },

      "gridPos": {

        "h": 9,

        "w": 4,

        "x": 16,

        "y": 1

      },

      "id": 2,

      "interval": "",

      "options": {

        "reduceOptions": {

          "calcs": [

            "mean"

          ],

          "fields": "",

          "values": false

        },

        "showThresholdLabels": false,

        "showThresholdMarkers": true

      },

      "pluginVersion": "7.3.3",

      "targets": [

        {

          "expr": "DCGM_FI_DEV_SM_CLOCK",

          "format": "time_series",

          "interval": "",

          "intervalFactor": 1,

          "legendFormat": "GPU {{gpu}}",

          "refId": "A"

        }

      ],

      "timeFrom": null,

      "timeShift": null,

      "title": "GPU核心频率",

      "type": "gauge"

    },

    {

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "mappings": [],

          "max": 5000,

          "min": 0,

          "thresholds": {

            "mode": "absolute",

            "steps": [

              {

                "color": "green",

                "value": null

              },

              {

                "color": "#EAB839",

                "value": 3000

              },

              {

                "color": "red",

                "value": 4000

              }

            ]

          },

          "unit": "MHz"

        },

        "overrides": []

      },

      "gridPos": {

        "h": 9,

        "w": 4,

        "x": 20,

        "y": 1

      },

      "id": 22,

      "options": {

        "reduceOptions": {

          "calcs": [

            "mean"

          ],

          "fields": "",

          "values": false

        },

        "showThresholdLabels": false,

        "showThresholdMarkers": true

      },

      "pluginVersion": "7.3.3",

      "targets": [

        {

          "expr": "DCGM_FI_DEV_MEM_CLOCK",

          "interval": "",

          "legendFormat": "",

          "queryType": "randomWalk",

          "refId": "A"

        }

      ],

      "timeFrom": null,

      "timeShift": null,

      "title": "GPU显存频率",

      "type": "gauge"

    },

    {

      "aliasColors": {},

      "bars": false,

      "dashLength": 10,

      "dashes": false,

      "datasource": "GPUPrometheus",

      "decimals": null,

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "unit": "%"

        },

        "overrides": []

      },

      "fill": 1,

      "fillGradient": 0,

      "gridPos": {

        "h": 20,

        "w": 8,

        "x": 0,

        "y": 10

      },

      "hiddenSeries": false,

      "id": 24,

      "legend": {

        "avg": false,

        "current": false,

        "max": false,

        "min": false,

        "show": false,

        "total": false,

        "values": false

      },

      "lines": true,

      "linewidth": 1,

      "nullPointMode": "null",

      "options": {

        "alertThreshold": true

      },

      "percentage": false,

      "pluginVersion": "7.3.3",

      "pointradius": 2,

      "points": false,

      "renderer": "flot",

      "seriesOverrides": [],

      "spaceLength": 10,

      "stack": false,

      "steppedLine": false,

      "targets": [

        {

          "expr": "DCGM_FI_DEV_GPU_UTIL",

          "interval": "",

          "legendFormat": "",

          "queryType": "randomWalk",

          "refId": "A"

        }

      ],

      "thresholds": [],

      "timeFrom": null,

      "timeRegions": [],

      "timeShift": null,

      "title": "GPU使用率",

      "tooltip": {

        "shared": true,

        "sort": 0,

        "value_type": "individual"

      },

      "type": "graph",

      "xaxis": {

        "buckets": null,

        "mode": "time",

        "name": null,

        "show": true,

        "values": []

      },

      "yaxes": [

        {

          "$$hashKey": "object:82",

          "format": "%",

          "label": null,

          "logBase": 1,

          "max": "100",

          "min": "0",

          "show": true

        },

        {

          "$$hashKey": "object:83",

          "format": "short",

          "label": null,

          "logBase": 1,

          "max": null,

          "min": null,

          "show": true

        }

      ],

      "yaxis": {

        "align": false,

        "alignLevel": null

      }

    },

    {

      "aliasColors": {},

      "bars": false,

      "dashLength": 10,

      "dashes": false,

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "links": [],

          "unit": "watt"

        },

        "overrides": []

      },

      "fill": 1,

      "fillGradient": 0,

      "gridPos": {

        "h": 20,

        "w": 8,

        "x": 8,

        "y": 10

      },

      "hiddenSeries": false,

      "id": 4,

      "legend": {

        "alignAsTable": true,

        "avg": true,

        "current": true,

        "max": true,

        "min": false,

        "rightSide": true,

        "show": true,

        "total": false,

        "values": true

      },

      "lines": true,

      "linewidth": 2,

      "nullPointMode": "null",

      "options": {

        "alertThreshold": true

      },

      "percentage": false,

      "pluginVersion": "7.3.3",

      "pointradius": 2,

      "points": false,

      "renderer": "flot",

      "seriesOverrides": [],

      "spaceLength": 10,

      "stack": false,

      "steppedLine": false,

      "targets": [

        {

          "expr": "DCGM_FI_DEV_POWER_USAGE",

          "interval": "",

          "legendFormat": "GPU {{gpu}}",

          "refId": "A"

        }

      ],

      "thresholds": [],

      "timeFrom": null,

      "timeRegions": [],

      "timeShift": null,

      "title": "GPU功耗",

      "tooltip": {

        "shared": true,

        "sort": 0,

        "value_type": "individual"

      },

      "type": "graph",

      "xaxis": {

        "buckets": null,

        "mode": "time",

        "name": null,

        "show": true,

        "values": []

      },

      "yaxes": [

        {

          "$$hashKey": "object:159",

          "format": "watt",

          "label": null,

          "logBase": 1,

          "max": "300",

          "min": "0",

          "show": true

        },

        {

          "$$hashKey": "object:160",

          "format": "short",

          "label": null,

          "logBase": 1,

          "max": null,

          "min": null,

          "show": true

        }

      ],

      "yaxis": {

        "align": false,

        "alignLevel": null

      }

    },

    {

      "aliasColors": {},

      "bars": false,

      "dashLength": 10,

      "dashes": false,

      "datasource": "GPUPrometheus",

      "fieldConfig": {

        "defaults": {

          "custom": {},

          "links": [],

          "unit": "mbytes"

        },

        "overrides": []

      },

      "fill": 1,

      "fillGradient": 0,

      "gridPos": {

        "h": 20,

        "w": 8,

        "x": 16,

        "y": 10

      },

      "hiddenSeries": false,

      "id": 18,

      "legend": {

        "avg": true,

        "current": false,

        "max": true,

        "min": false,

        "rightSide": true,

        "show": true,

        "total": false,

        "values": true

      },

      "lines": true,

      "linewidth": 2,

      "nullPointMode": "null",

      "options": {

        "alertThreshold": true

      },

      "percentage": false,

      "pluginVersion": "7.3.3",

      "pointradius": 2,

      "points": false,

      "renderer": "flot",

      "seriesOverrides": [],

      "spaceLength": 10,

      "stack": false,

      "steppedLine": false,

      "targets": [

        {

          "expr": "DCGM_FI_DEV_FB_USED",

          "interval": "",

          "legendFormat": "GPU {{gpu}}",

          "refId": "A"

        }

      ],

      "thresholds": [],

      "timeFrom": null,

      "timeRegions": [],

      "timeShift": null,

      "title": "GPU显存使用大小",

      "tooltip": {

        "shared": true,

        "sort": 0,

        "value_type": "individual"

      },

      "type": "graph",

      "xaxis": {

        "buckets": null,

        "mode": "time",

        "name": null,

        "show": true,

        "values": []

      },

      "yaxes": [

        {

          "$$hashKey": "object:72",

          "format": "mbytes",

          "label": null,

          "logBase": 1,

          "max": "16000",

          "min": "0",

          "show": true

        },

        {

          "$$hashKey": "object:73",

          "format": "short",

          "label": null,

          "logBase": 1,

          "max": null,

          "min": null,

          "show": true

        }

      ],

      "yaxis": {

        "align": false,

        "alignLevel": null

      }

    }

  ],

  "refresh": "30s",

  "schemaVersion": 26,

  "style": "dark",

  "tags": [],

  "templating": {

    "list": [

      {

        "allValue": null,

        "current": {

          "selected": false,

          "text": "0",

          "value": "0"

        },

        "datasource": "GPUPrometheus",

        "definition": "label_values(gpu)",

        "error": null,

        "hide": 0,

        "includeAll": false,

        "label": null,

        "multi": false,

        "name": "gpu",

        "options": [],

        "query": "label_values(gpu)",

        "refresh": 1,

        "regex": "",

        "skipUrlSync": false,

        "sort": 1,

        "tagValuesQuery": "",

        "tags": [],

        "tagsQuery": "",

        "type": "query",

        "useTags": false

      }

    ]

  },

  "time": {

    "from": "now-15m",

    "to": "now"

  },

  "timepicker": {

    "refresh_intervals": [

      "5s",

      "10s",

      "30s",

      "1m",

      "5m",

      "15m",

      "30m",

      "1h",

      "2h",

      "1d"

    ]

  },

  "timezone": "",

  "title": "GPU_Dashboard",

  "uid": "Oxdasd",

  "version": 1

}

3.4 Grafana Datasource配置文件

修改etc/grafana/provisioning/dashboards/sample.yaml,内容如下,该操作目的是能够使Grafana启动时就加载数据源,而不用手动导入,其中name是数据源名称,type是数据源类型,url是数据源地址。

# # config file version

apiVersion: 1

# # on what's available in the database

datasources:

#   # <string, required> name of the datasource. Required

 - name: GPUPrometheus

   type: prometheus

   access: proxy

   orgId: 1

   url: http://localhost:30113

   version: 1

   editable: false

4.Docker镜像制作

4.1 镜像启动脚本

编写镜像启动脚本entrypoint.sh,内容如下

#!/bin/bash

#启动dcgm-exporter并指定端口为30112

dcgm-exporter -a :30112   & \

#启动prometheus并指定配置文件和端口

prometheus  --config.file=/etc/prometheus/prometheus.yml --web.listen-address=:30113 & \

#启动grafana指定配置文件、Pid路径、日志路径和启动参数

/usr/sbin/grafana-server  \

  --config=${CONF_FILE}                                   \

  --pidfile=${PID_FILE_DIR}/grafana-server.pid            \

  --packaging=rpm                                         \

  --homepath "/usr/share/grafana"                         \

  cfg:default.paths.logs=${LOG_DIR}                       \

  cfg:default.paths.data=${DATA_DIR}                      \

  cfg:default.paths.plugins=${PLUGINS_DIR}                \

  cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR}

4.2 Dockerfile

Dockerfile内容如下

FROM centos:centos7.4.1708

WORKDIR /opt/

#拷贝RPM包和启动脚本

COPY entrypoint.sh datacenter-gpu-manager-2.0.13-1-x86_64.rpm grafana-7.3.3-1.x86_64.rpm /opt/

#拷贝二进制文件

COPY prometheus promtool dcgm-exporter /bin/

#执行rpm安装,清除rpm文件

RUN yum install datacenter-gpu-manager-2.0.13-1-x86_64.rpm  grafana-7.3.3-1.x86_64.rpm -y && rm -rf datacenter-gpu-manager-2.0.13-1-x86_64.rpm grafana-7.3.3-1.x86_64.rpm && mkdir -p /var/lib/grafana/dashboards

#导入环境变量,使Grafana支持无密码查看dashboard

ENV GRAFANA_USER=grafana GRAFANA_GROUP=grafana GRAFANA_HOME=/usr/share/grafana LOG_DIR=/var/log/grafana DATA_DIR=/var/lib/grafana MAX_OPEN_FILES=10000 CONF_DIR=/etc/grafana CONF_FILE=/etc/grafana/grafana.ini RESTART_ON_UPGRADE=true PLUGINS_DIR=/var/lib/grafana/plugins PROVISIONING_CFG_DIR=/etc/grafana/provisioning PID_FILE_DIR=/var/run/grafana GF_AUTH_PROXY_ENABLED=true GF_AUTH_ANONYMOUS_ENABLED=true GF_SERVER_HTTP_PORT=30114 GF_USERS_DISABLE_LOGIN_FORM=true

#拷贝配置文件

COPY etc/ /etc/

#拷贝dashboad展示页面

COPY GPU_Dashboard.json /var/lib/grafana/dashboards

ENTRYPOINT ["/opt/entrypoint.sh"]

至此,所有配置文件和软件均已准备完毕,镜像制作文件夹内容如下

执行镜像构建命令

docker build -t gpu-monitoring:v1.0 .

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐