【Kubenates新增gpu节点调度】

k8s节点加入gpu节点并让服务使用gpu

eason__sz

1172人浏览 · 2024-05-22 16:35:00

eason__sz · 2024-05-22 16:35:00 发布

安装cuda相关，参考前置cuda安装文档
```
# 验证安装
nvidia-smi
```

安装Nvidia Container Toolkit

2.1 设置Nvidia Docker存储库

添加存储库：

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo rpm --import -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

2.2 安装nvidia-docker2 并且重启docker

sudo yum install -y nvidia-docker2
sudo systemctl restart docker

配置 Docker 使用 NVIDIA 运行时

编辑 Docker 的配置文件 /etc/docker/daemon.json，确保添加 NVIDIA 运行时配置：

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

保存文件并重启docker
```
sudo systemctl restart docker
```
gpu节点加入k8s，并设置污点

此处自行添加，不同的集群添加方式略有差异，以下为可用命令
```
kubectl taint nodes [your gpu hostname] nvidia.com/gpu:NoSchedule-
```

部署 NVIDIA Device Plugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml

由于外网无法下载，可以使用如下资源清单

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

kubectl apply -f [your define yaml file]

查看插件日志

执行以下命令，有如下类似结果：

> kubectl get pod -n kube-system -o wide| grep nvidia
nvidia-device-plugin-daemonset-qlhkz       1/1     Running     0          4h37m   ###    <master>     <none>         <none>
nvidia-device-plugin-daemonset-xh7h5       1/1     Running     0          4h37m   ###    <gpu-node>   <none>         <none>

可以分别看到两个pod的日志（注意我的master是没有gpu的），先看看gpu正常的守护pod的日志

> kubectl logs -n kube-system nvidia-device-plugin-daemonset-xh7h5
I0522 03:26:46.114150       1 main.go:178] Starting FS watcher.
I0522 03:26:46.114296       1 main.go:185] Starting OS watcher.
I0522 03:26:46.115304       1 main.go:200] Starting Plugins.
I0522 03:26:46.115356       1 main.go:257] Loading configuration.
I0522 03:26:46.116834       1 main.go:265] Updating config with default resource matching patterns.
I0522 03:26:46.118077       1 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0522 03:26:46.118102       1 main.go:279] Retrieving plugins.
I0522 03:26:46.119598       1 factory.go:104] Detected NVML platform: found NVML library
I0522 03:26:46.119674       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0522 03:26:51.201381       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0522 03:26:51.202815       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0522 03:26:51.211934       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

可以看到正常连接，且后段日志显示已成功注册。

再来看看master没有gpu服务的日志：

> kubectl logs -n kube-system nvidia-device-plugin-daemonset-qlhkz
I0522 03:26:20.612182       1 main.go:178] Starting FS watcher.
I0522 03:26:20.612292       1 main.go:185] Starting OS watcher.
I0522 03:26:20.612597       1 main.go:200] Starting Plugins.
I0522 03:26:20.612617       1 main.go:257] Loading configuration.
I0522 03:26:20.613028       1 main.go:265] Updating config with default resource matching patterns.
I0522 03:26:20.613211       1 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0522 03:26:20.613223       1 main.go:279] Retrieving plugins.
W0522 03:26:20.613283       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0522 03:26:20.613320       1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0522 03:26:20.613345       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0522 03:26:20.613349       1 factory.go:112] Incompatible platform detected
E0522 03:26:20.613353       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0522 03:26:20.613356       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0522 03:26:20.613359       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0522 03:26:20.613362       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0522 03:26:20.613367       1 main.go:308] No devices found. Waiting indefinitely.

正确的检测到了没有gpu。

检查 GPU 资源

确保 Kubernetes 节点能够识别 GPU 资源：

kubectl describe node <your-node-name>

可以看到如下lable的信息为

......
Allocatable:
  cpu:                80
  ephemeral-storage:  48294789041
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263504264Ki
  nvidia.com/gpu:     8
  pods:               110
......

省心可以直接grep gpu：

kubectl describe node <your-node-name> | grep -i gpu

跑一个gpu的服务验证

到集群里面运行一个pod，执行以下命令，由于笔者安装的是cuda12.1版本，读者可以自行替换镜像版本

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvidia/cuda:12.1.1-runtime-ubuntu20.04
    command: ["nvidia-smi"]
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"  # 使用所有可用 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

最后看到以下日志即大功告成。

Wed May 22 08:14:55 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.1     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:1B:00.0 Off |                  Off |
| 46%   30C    P8             15W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:1C:00.0 Off |                  Off |
| 44%   28C    P8             24W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off |   00000000:1D:00.0 Off |                  Off |
| 46%   30C    P8             15W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off |   00000000:1E:00.0 Off |                  Off |
| 46%   30C    P8             24W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off |   00000000:3D:00.0 Off |                  Off |
| 45%   28C    P8             12W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off |   00000000:3F:00.0 Off |                  Off |
| 46%   29C    P8             17W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off |   00000000:40:00.0 Off |                  Off |
| 46%   30C    P8             12W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off |   00000000:41:00.0 Off |                  Off |
| 46%   29C    P8              8W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

一些踩坑参考

10.1 安装Nvidia Container Toolkit的时候，我按照官方文档来安装，在修改了容器运行时的配置以后，可以参考如下命令在gpu服务器上面验证是否可以通过容器调用宿主机的gpu：
```
docker run --runtime=nvidia --gpus all nvidia/cuda nvidia-smi
```
结果与上面的结果一致。

10.2 如果以上没有问题，就需要参考nvidia的守护进程的日志了，即上述 nvidia-device-plugin-daemonset 拉起来的容器的日志即可，笔者在部署之初遇到过kubelet通信的10250端口被占用导致这个nvidia的守护进程始终无法构建grpc通信的问题，然后重启了集群…等等一系列问题，请在部署之初保证集群的可用性。参考 k8s常用端口来排查问题。

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub