【Kubenates新增gpu节点调度】
k8s节点加入gpu节点并让服务使用gpu
-
安装cuda相关,参考前置cuda安装文档
# 验证安装 nvidia-smi
-
安装Nvidia Container Toolkit
2.1 设置Nvidia Docker存储库
添加存储库:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo rpm --import - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
2.2 安装
nvidia-docker2
并且重启dockersudo yum install -y nvidia-docker2 sudo systemctl restart docker
-
配置 Docker 使用 NVIDIA 运行时
编辑 Docker 的配置文件
/etc/docker/daemon.json
,确保添加 NVIDIA 运行时配置:{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
-
保存文件并重启docker
sudo systemctl restart docker
-
gpu节点加入k8s,并设置污点
此处自行添加,不同的集群添加方式略有差异,以下为可用命令
kubectl taint nodes [your gpu hostname] nvidia.com/gpu:NoSchedule-
-
部署 NVIDIA Device Plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
由于外网无法下载,可以使用如下资源清单
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0 name: nvidia-device-plugin-ctr env: - name: FAIL_ON_INIT_ERROR value: "false" securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
kubectl apply -f [your define yaml file]
-
查看插件日志
执行以下命令,有如下类似结果:
> kubectl get pod -n kube-system -o wide| grep nvidia nvidia-device-plugin-daemonset-qlhkz 1/1 Running 0 4h37m ### <master> <none> <none> nvidia-device-plugin-daemonset-xh7h5 1/1 Running 0 4h37m ### <gpu-node> <none> <none>
可以分别看到两个pod的日志(注意我的master是没有gpu的),先看看gpu正常的守护pod的日志
> kubectl logs -n kube-system nvidia-device-plugin-daemonset-xh7h5 I0522 03:26:46.114150 1 main.go:178] Starting FS watcher. I0522 03:26:46.114296 1 main.go:185] Starting OS watcher. I0522 03:26:46.115304 1 main.go:200] Starting Plugins. I0522 03:26:46.115356 1 main.go:257] Loading configuration. I0522 03:26:46.116834 1 main.go:265] Updating config with default resource matching patterns. I0522 03:26:46.118077 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0522 03:26:46.118102 1 main.go:279] Retrieving plugins. I0522 03:26:46.119598 1 factory.go:104] Detected NVML platform: found NVML library I0522 03:26:46.119674 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0522 03:26:51.201381 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu' I0522 03:26:51.202815 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0522 03:26:51.211934 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
可以看到正常连接,且后段日志显示已成功注册。
再来看看master没有gpu服务的日志:
> kubectl logs -n kube-system nvidia-device-plugin-daemonset-qlhkz I0522 03:26:20.612182 1 main.go:178] Starting FS watcher. I0522 03:26:20.612292 1 main.go:185] Starting OS watcher. I0522 03:26:20.612597 1 main.go:200] Starting Plugins. I0522 03:26:20.612617 1 main.go:257] Loading configuration. I0522 03:26:20.613028 1 main.go:265] Updating config with default resource matching patterns. I0522 03:26:20.613211 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0522 03:26:20.613223 1 main.go:279] Retrieving plugins. W0522 03:26:20.613283 1 factory.go:31] No valid resources detected, creating a null CDI handler I0522 03:26:20.613320 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0522 03:26:20.613345 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0522 03:26:20.613349 1 factory.go:112] Incompatible platform detected E0522 03:26:20.613353 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0522 03:26:20.613356 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0522 03:26:20.613359 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0522 03:26:20.613362 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0522 03:26:20.613367 1 main.go:308] No devices found. Waiting indefinitely.
正确的检测到了没有gpu。
-
检查 GPU 资源
确保 Kubernetes 节点能够识别 GPU 资源:
kubectl describe node <your-node-name>
可以看到如下lable的信息为
...... Allocatable: cpu: 80 ephemeral-storage: 48294789041 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263504264Ki nvidia.com/gpu: 8 pods: 110 ......
省心可以直接grep gpu:
kubectl describe node <your-node-name> | grep -i gpu
-
跑一个gpu的服务验证
到集群里面运行一个pod,执行以下命令,由于笔者安装的是cuda12.1版本,读者可以自行替换镜像版本
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container image: nvidia/cuda:12.1.1-runtime-ubuntu20.04 command: ["nvidia-smi"] env: - name: NVIDIA_VISIBLE_DEVICES value: "all" # 使用所有可用 GPU tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule EOF
最后看到以下日志即大功告成。
Wed May 22 08:14:55 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.1 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:1B:00.0 Off | Off | | 46% 30C P8 15W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:1C:00.0 Off | Off | | 44% 28C P8 24W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:1D:00.0 Off | Off | | 46% 30C P8 15W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 Off | 00000000:1E:00.0 Off | Off | | 46% 30C P8 24W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off | | 45% 28C P8 12W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 4090 Off | 00000000:3F:00.0 Off | Off | | 46% 29C P8 17W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 4090 Off | 00000000:40:00.0 Off | Off | | 46% 30C P8 12W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off | | 46% 29C P8 8W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
-
一些踩坑参考
10.1 安装Nvidia Container Toolkit的时候,我按照 官方文档 来安装,在修改了容器运行时的配置以后,可以参考如下命令在gpu服务器上面验证是否可以通过容器调用宿主机的gpu:
docker run --runtime=nvidia --gpus all nvidia/cuda nvidia-smi
结果与上面的结果一致。
10.2 如果以上没有问题,就需要参考nvidia的守护进程的日志了,即上述 nvidia-device-plugin-daemonset 拉起来的容器的日志即可,笔者在部署之初遇到过kubelet通信的10250端口被占用导致这个nvidia的守护进程始终无法构建grpc通信的问题,然后重启了集群…等等一系列问题,请在部署之初保证集群的可用性。参考 k8s常用端口 来排查问题。
更多推荐
所有评论(0)