【博客499】k8s gpu分片实现细粒度分配
场景:有时候一个pod需要使用gpu,但是不需要使用一整张gpu卡,如果按照原生的对gpu卡进行一对一分配,利用率太低解决方法:使用显卡插件的timeSlicing功能,实现gpu时间切片。
·
k8s gpu分片实现细粒度分配
场景:
有时候一个pod需要使用gpu,但是不需要使用一整张gpu卡,如果按照原生的对gpu卡进行一对一分配,利用率太低
解决方法:
使用显卡插件的timeSlicing功能,实现gpu时间切片
显卡时间片切分
https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing
example:nvidia gpu timeSlicing
部署nvidia gpu plugin时,将gpu划分为100份
kubectl get cm nvidia-config -n kube-system -o yaml
apiVersion: v1
data:
config: |
{
"version": "v1",
"sharing": {
"timeSlicing": {
"resources": [
{
"name": "nvidia.com/gpu",
"replicas": 100,
}
]
}
}
}
kind: ConfigMap
metadata:
name: nvidia-config
namespace: kube-system
kubectl get ds nvidia-device-plugin-daemonset -n kube-system -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
creationTimestamp: null
labels:
name: nvidia-device-plugin-ds
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/nvidia-gpu
operator: Exists
containers:
- args:
- --config-file=/etc/nvidia/config
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
image: nvcr.io/nvidia/k8s-device-plugin:v0.12.2
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
resources: {}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
- mountPath: /etc/nvidia
name: config
dnsPolicy: ClusterFirst
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: device-plugin
- configMap:
defaultMode: 420
name: nvidia-config
name: config
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
nvidia这个方案适用场景和不适用的场景
适用于: 一机一卡的场景
不适用于: 一机多卡,会出现在pod里看到多张显卡的情况
解决方案:
1、使用failRequestsGreaterThanOne参考:
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
https://github.com/NVIDIA/k8s-device-plugin
2、适用阿里云开源方案:
参考:
https://github.com/AliyunContainerService/gpushare-scheduler-extender
更多推荐
已为社区贡献51条内容
所有评论(0)