nvidia-device-plugin实现gpu虚拟化
gpu k8s虚拟化方案简介:nvidia-device-plugin
·
简介
NVIDIA device plugin是以dameonset方式部署到k8s集群,部署后可以实现:
- 暴露集群中n每个node节点的gpu数量
- 跟踪gpu健康状态
- 可以在k8s集群中运行gpu容器
前置条件
- NVIDIA drivers ~= 384.81
- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
- nvidia-container-runtime configured as the default low-level runtime
- Kubernetes version >= 1.10
快速开始
准备GPU节点
1.为每台节点安装nvidia-container-toolkit
2.设置nvidia-container-runtime为默认容器运行时
#cat /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"data-root": "/data/docker",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
安装文档参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
在K8S中启用对GPU的支持
1.部署nvidia-device-plugin daemonset
- nvidia-device-plugin-daemonset资源描述配置文件
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
containers:
- command:
- /usr/bin/nvidia-device-plugin
- --config-file
- /etc/nvidia-device-plugin/nvidia-device-plugin-config.yaml
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: MIG_STRATEGY
value: mixed
image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
resources: {}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
- mountPath: /etc/nvidia-device-plugin
name: nvidia-device-plugin-config
dnsPolicy: ClusterFirst
nodeSelector:
nvidia.com/gpu.present: "true"
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: device-plugin
- configMap:
defaultMode: 420
name: nvidia-device-plugin-config
name: nvidia-device-plugin-config
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
- configmap配置(CUDA Time-Slicing)
apiVersion: v1
data:
nvidia-device-plugin-config.yaml: |
version: v1
flags:
migStrategy: mixed
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 30
kind: ConfigMap
部署后查看node描述,发现已经会将4卡GPU Node暴露出120 nvidia.com/gpu资源
#kubectl describe node/10.x.x.x |egrep -A 7 "Capacity" |egrep "Capacity|nvidia.com/gpu"
Capacity:
nvidia.com/gpu: 120
更多推荐
已为社区贡献2条内容
所有评论(0)