K8S调用GPU资源配置指南(原創)

在GPU節點安裝NVIDIA驅動和docker，以及nvidia-container-toolkit （docker 19版本之後安裝nvidia-container-toolkit，無需安裝nvidia-docker）如有正常打印出GPU信息，說明正常。修改docker runtime。使用以下官方測試方式。

飞哥传说大号

1594人浏览 · 2023-11-14 17:09:59

飞哥传说大号 · 2023-11-14 17:09:59 发布

一.在GPU節點安裝nvidia-container-toolkit

在GPU節點安裝NVIDIA驅動和docker，以及nvidia-container-toolkit （docker 19版本之後安裝nvidia-container-toolkit，無需安裝nvidia-docker）

二.修改daemon.json

修改docker runtime

"runtimes": {

     "nvidia": {

         "path": "/usr/bin/nvidia-container-runtime",

         "runtimeArgs": []

}

}

三.驗證runtime是否生效

使用以下官方測試方式

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

如有正常打印出GPU信息，說明正常

四.安裝nvidia-device-plugin插件

執行以下yaml

wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml

kubectl create -f nvidia-device-plugin.yml

yml內容

# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

apiVersion: apps/v1

kind: DaemonSet

metadata:

name: nvidia-device-plugin-daemonset

namespace: kube-system

spec:

selector:

    matchLabels:

      name: nvidia-device-plugin-ds

updateStrategy:

    type: RollingUpdate

template:

    metadata:

      labels:

        name: nvidia-device-plugin-ds

    spec:

      tolerations:

      - key: nvidia.com/gpu

        operator: Exists

        effect: NoSchedule

      # Mark this pod as a critical add-on; when enabled, the critical add-on

      # scheduler reserves resources for critical add-on pods so that they can

      # be rescheduled after a failure.

      # See Guaranteed Scheduling For Critical Add-On Pods | Kubernetes

      priorityClassName: "system-node-critical"

      containers:

      - image: nvcr.io/nvidia/k8s-device-plugin:v0.12.0

        name: nvidia-device-plugin-ctr

        env:

          - name: FAIL_ON_INIT_ERROR

            value: "false"

        securityContext:

          allowPrivilegeEscalation: false

          capabilities:

            drop: ["ALL"]

        volumeMounts:

          - name: device-plugin

            mountPath: /var/lib/kubelet/device-plugins

      volumes:

        - name: device-plugin

          hostPath:

            path: /var/lib/kubelet/device-plugins