简介

 NVIDIA device plugin是以dameonset方式部署到k8s集群,部署后可以实现:

  • 暴露集群中n每个node节点的gpu数量
  • 跟踪gpu健康状态
  • 可以在k8s集群中运行gpu容器

前置条件

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

快速开始

准备GPU节点

1.为每台节点安装nvidia-container-toolkit

2.设置nvidia-container-runtime为默认容器运行时

#cat /etc/docker/daemon.json
{
	"exec-opts": ["native.cgroupdriver=systemd"],
	"data-root": "/data/docker",
	"default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

安装文档参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

在K8S中启用对GPU的支持

1.部署nvidia-device-plugin daemonset

  • nvidia-device-plugin-daemonset资源描述配置文件
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      containers:
      - command:
        - /usr/bin/nvidia-device-plugin
        - --config-file
        - /etc/nvidia-device-plugin/nvidia-device-plugin-config.yaml
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        - name: MIG_STRATEGY
          value: mixed
        image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
        - mountPath: /etc/nvidia-device-plugin
          name: nvidia-device-plugin-config
      dnsPolicy: ClusterFirst
      nodeSelector:
        nvidia.com/gpu.present: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
      - configMap:
          defaultMode: 420
          name: nvidia-device-plugin-config
        name: nvidia-device-plugin-config
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  • configmap配置(CUDA Time-Slicing)
apiVersion: v1
data:
  nvidia-device-plugin-config.yaml: |
    version: v1
    flags:
      migStrategy: mixed
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 30
kind: ConfigMap

部署后查看node描述,发现已经会将4卡GPU  Node暴露出120 nvidia.com/gpu资源

#kubectl describe node/10.x.x.x |egrep -A 7 "Capacity" |egrep "Capacity|nvidia.com/gpu"
Capacity:
  nvidia.com/gpu:     120

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐