1. 安装cuda相关,参考前置cuda安装文档

    # 验证安装
    nvidia-smi
    
  2. 安装Nvidia Container Toolkit

    2.1 设置Nvidia Docker存储库

    添加存储库:

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo rpm --import -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    

    2.2 安装nvidia-docker2 并且重启docker

    sudo yum install -y nvidia-docker2
    sudo systemctl restart docker
    
  3. 配置 Docker 使用 NVIDIA 运行时

    编辑 Docker 的配置文件 /etc/docker/daemon.json,确保添加 NVIDIA 运行时配置:

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
  4. 保存文件并重启docker

    sudo systemctl restart docker
    
  5. gpu节点加入k8s,并设置污点

    此处自行添加,不同的集群添加方式略有差异,以下为可用命令

    kubectl taint nodes [your gpu hostname] nvidia.com/gpu:NoSchedule-
    
  6. 部署 NVIDIA Device Plugin

    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
    

    由于外网无法下载,可以使用如下资源清单

    # Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nvidia-device-plugin-daemonset
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: nvidia-device-plugin-ds
      updateStrategy:
        type: RollingUpdate
      template:
        metadata:
          labels:
            name: nvidia-device-plugin-ds
        spec:
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          # Mark this pod as a critical add-on; when enabled, the critical add-on
          # scheduler reserves resources for critical add-on pods so that they can
          # be rescheduled after a failure.
          # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
          priorityClassName: "system-node-critical"
          containers:
          - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
            name: nvidia-device-plugin-ctr
            env:
              - name: FAIL_ON_INIT_ERROR
                value: "false"
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop: ["ALL"]
            volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
          volumes:
          - name: device-plugin
            hostPath:
              path: /var/lib/kubelet/device-plugins
    
    kubectl apply -f [your define yaml file]
    
  7. 查看插件日志

    执行以下命令,有如下类似结果:

    > kubectl get pod -n kube-system -o wide| grep nvidia
    nvidia-device-plugin-daemonset-qlhkz       1/1     Running     0          4h37m   ###    <master>     <none>         <none>
    nvidia-device-plugin-daemonset-xh7h5       1/1     Running     0          4h37m   ###    <gpu-node>   <none>         <none>
    

    可以分别看到两个pod的日志(注意我的master是没有gpu的),先看看gpu正常的守护pod的日志

    > kubectl logs -n kube-system nvidia-device-plugin-daemonset-xh7h5
    I0522 03:26:46.114150       1 main.go:178] Starting FS watcher.
    I0522 03:26:46.114296       1 main.go:185] Starting OS watcher.
    I0522 03:26:46.115304       1 main.go:200] Starting Plugins.
    I0522 03:26:46.115356       1 main.go:257] Loading configuration.
    I0522 03:26:46.116834       1 main.go:265] Updating config with default resource matching patterns.
    I0522 03:26:46.118077       1 main.go:276] 
    Running with config:
    {
      "version": "v1",
      "flags": {
        "migStrategy": "none",
        "failOnInitError": false,
        "mpsRoot": "",
        "nvidiaDriverRoot": "/",
        "gdsEnabled": false,
        "mofedEnabled": false,
        "useNodeFeatureAPI": null,
        "plugin": {
          "passDeviceSpecs": false,
          "deviceListStrategy": [
            "envvar"
          ],
          "deviceIDStrategy": "uuid",
          "cdiAnnotationPrefix": "cdi.k8s.io/",
          "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
          "containerDriverRoot": "/driver-root"
        }
      },
      "resources": {
        "gpus": [
          {
            "pattern": "*",
            "name": "nvidia.com/gpu"
          }
        ]
      },
      "sharing": {
        "timeSlicing": {}
      }
    }
    I0522 03:26:46.118102       1 main.go:279] Retrieving plugins.
    I0522 03:26:46.119598       1 factory.go:104] Detected NVML platform: found NVML library
    I0522 03:26:46.119674       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
    I0522 03:26:51.201381       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
    I0522 03:26:51.202815       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
    I0522 03:26:51.211934       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
    

    可以看到正常连接,且后段日志显示已成功注册。

    再来看看master没有gpu服务的日志:

    > kubectl logs -n kube-system nvidia-device-plugin-daemonset-qlhkz
    I0522 03:26:20.612182       1 main.go:178] Starting FS watcher.
    I0522 03:26:20.612292       1 main.go:185] Starting OS watcher.
    I0522 03:26:20.612597       1 main.go:200] Starting Plugins.
    I0522 03:26:20.612617       1 main.go:257] Loading configuration.
    I0522 03:26:20.613028       1 main.go:265] Updating config with default resource matching patterns.
    I0522 03:26:20.613211       1 main.go:276] 
    Running with config:
    {
      "version": "v1",
      "flags": {
        "migStrategy": "none",
        "failOnInitError": false,
        "mpsRoot": "",
        "nvidiaDriverRoot": "/",
        "gdsEnabled": false,
        "mofedEnabled": false,
        "useNodeFeatureAPI": null,
        "plugin": {
          "passDeviceSpecs": false,
          "deviceListStrategy": [
            "envvar"
          ],
          "deviceIDStrategy": "uuid",
          "cdiAnnotationPrefix": "cdi.k8s.io/",
          "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
          "containerDriverRoot": "/driver-root"
        }
      },
      "resources": {
        "gpus": [
          {
            "pattern": "*",
            "name": "nvidia.com/gpu"
          }
        ]
      },
      "sharing": {
        "timeSlicing": {}
      }
    }
    I0522 03:26:20.613223       1 main.go:279] Retrieving plugins.
    W0522 03:26:20.613283       1 factory.go:31] No valid resources detected, creating a null CDI handler
    I0522 03:26:20.613320       1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
    I0522 03:26:20.613345       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
    E0522 03:26:20.613349       1 factory.go:112] Incompatible platform detected
    E0522 03:26:20.613353       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
    E0522 03:26:20.613356       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    E0522 03:26:20.613359       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
    E0522 03:26:20.613362       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
    I0522 03:26:20.613367       1 main.go:308] No devices found. Waiting indefinitely.
    

    正确的检测到了没有gpu。

  8. 检查 GPU 资源

    确保 Kubernetes 节点能够识别 GPU 资源:

    kubectl describe node <your-node-name> 
    

    可以看到如下lable的信息为

    ......
    Allocatable:
      cpu:                80
      ephemeral-storage:  48294789041
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             263504264Ki
      nvidia.com/gpu:     8
      pods:               110
    ......
    

    省心可以直接grep gpu:

    kubectl describe node <your-node-name> | grep -i gpu
    
  9. 跑一个gpu的服务验证

    到集群里面运行一个pod,执行以下命令,由于笔者安装的是cuda12.1版本,读者可以自行替换镜像版本

    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod
    spec:
      containers:
      - name: gpu-container
        image: nvidia/cuda:12.1.1-runtime-ubuntu20.04
        command: ["nvidia-smi"]
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"  # 使用所有可用 GPU
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
    EOF
    

    最后看到以下日志即大功告成。

    Wed May 22 08:14:55 2024
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.1     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA GeForce RTX 4090        Off |   00000000:1B:00.0 Off |                  Off |
    | 46%   30C    P8             15W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA GeForce RTX 4090        Off |   00000000:1C:00.0 Off |                  Off |
    | 44%   28C    P8             24W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   2  NVIDIA GeForce RTX 4090        Off |   00000000:1D:00.0 Off |                  Off |
    | 46%   30C    P8             15W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   3  NVIDIA GeForce RTX 4090        Off |   00000000:1E:00.0 Off |                  Off |
    | 46%   30C    P8             24W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   4  NVIDIA GeForce RTX 4090        Off |   00000000:3D:00.0 Off |                  Off |
    | 45%   28C    P8             12W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   5  NVIDIA GeForce RTX 4090        Off |   00000000:3F:00.0 Off |                  Off |
    | 46%   29C    P8             17W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   6  NVIDIA GeForce RTX 4090        Off |   00000000:40:00.0 Off |                  Off |
    | 46%   30C    P8             12W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    |   7  NVIDIA GeForce RTX 4090        Off |   00000000:41:00.0 Off |                  Off |
    | 46%   29C    P8              8W /  450W |       0MiB /  24564MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
    
  10. 一些踩坑参考

    10.1 安装Nvidia Container Toolkit的时候,我按照 官方文档 来安装,在修改了容器运行时的配置以后,可以参考如下命令在gpu服务器上面验证是否可以通过容器调用宿主机的gpu:

    docker run --runtime=nvidia --gpus all nvidia/cuda nvidia-smi
    

    结果与上面的结果一致。

    10.2 如果以上没有问题,就需要参考nvidia的守护进程的日志了,即上述 nvidia-device-plugin-daemonset 拉起来的容器的日志即可,笔者在部署之初遇到过kubelet通信的10250端口被占用导致这个nvidia的守护进程始终无法构建grpc通信的问题,然后重启了集群…等等一系列问题,请在部署之初保证集群的可用性。参考 k8s常用端口 来排查问题。

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐