一.在GPU節點安裝nvidia-container-toolkit

在GPU節點安裝NVIDIA驅動和docker,以及nvidia-container-toolkit (docker 19版本之後安裝nvidia-container-toolkit,無需安裝nvidia-docker)

二.修改daemon.json

修改docker runtime

"runtimes": {

     "nvidia": {

         "path": "/usr/bin/nvidia-container-runtime",

         "runtimeArgs": []

  }

 }

三.驗證runtime是否生效

使用以下官方測試方式

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

如有正常打印出GPU信息,說明正常

四.安裝nvidia-device-plugin插件

執行以下yaml

 wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.0/nvidia-device-plugin.yml

kubectl create -f nvidia-device-plugin.yml

yml內容 

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

apiVersion: apps/v1

kind: DaemonSet

metadata:

  name: nvidia-device-plugin-daemonset

  namespace: kube-system

spec:

  selector:

    matchLabels:

      name: nvidia-device-plugin-ds

  updateStrategy:

    type: RollingUpdate

  template:

    metadata:

      labels:

        name: nvidia-device-plugin-ds

    spec:

      tolerations:

      - key: nvidia.com/gpu

        operator: Exists

        effect: NoSchedule

      # Mark this pod as a critical add-on; when enabled, the critical add-on

      # scheduler reserves resources for critical add-on pods so that they can

      # be rescheduled after a failure.

      # See Guaranteed Scheduling For Critical Add-On Pods | Kubernetes

      priorityClassName: "system-node-critical"

      containers:

      - image: nvcr.io/nvidia/k8s-device-plugin:v0.12.0

        name: nvidia-device-plugin-ctr

        env:

          - name: FAIL_ON_INIT_ERROR

            value: "false"

        securityContext:

          allowPrivilegeEscalation: false

          capabilities:

            drop: ["ALL"]

        volumeMounts:

          - name: device-plugin

            mountPath: /var/lib/kubelet/device-plugins

      volumes:

        - name: device-plugin

          hostPath:

            path: /var/lib/kubelet/device-plugins

五.查看nvidia-device-plugin-daemonset 運行狀態

kubectl get pod -n kube-system

 注意:只有GPU節點的nvidia-device-plugin-daemonset運行是正常的,打印日誌如下:

其他沒有安裝gpu卡的節點,打印日誌會報以下錯誤:

六.查看node節點上gpu標籤

kubectl describe node zf-ai-gpu02

GPU所有資源

已分配GPU資源

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐