rancher部署gpushare-scheduler-extender
gpushare-scheduler-extender是阿里云在kubernetes平台上开发的针对GPU进行虚拟化的方案,首先,参考https://blog.csdn.net/vah101/article/details/108098827,安装k8s-deviece-plugin,并将/etc/docker/daemon.json配置为:{"default-runtime": "nvidia"
gpushare-scheduler-extender是阿里云在kubernetes平台上开发的针对GPU进行虚拟化的方案,
首先,参考k8s集群中GPU结点的配置_vah101的专栏-CSDN博客,安装k8s-deviece-plugin,并将/etc/docker/daemon.json配置为:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
1. 修个原有集群的kube-scheduler的配置参数
rancher与原生的kubernetes略有不同,
它的kube-scheduler并不是一个可执行程序,而是一个docker镜像。gpushare-scheduler-extender需要将修改kube-scheduler的配置文件,所以就需要首先对rke的配置进行修改,将新增加的配置信息加入到rek的yml中。
在rancher的集群列表中,选择一个集群,点击其右侧按钮对应的“升级”按钮
点击“编辑YAML”按钮,在其中的services之下加入如下内容:
scheduler:
extra_args:
address: 0.0.0.0
kubeconfig: /etc/kubernetes/ssl/kubecfg-kube-scheduler.yaml
leader-elect: 'true'
policy-config-file: /etc/kubernetes/ssl/scheduler-policy-config.json
profiling: 'false'
v: '2'
注意,这里的内容可以冲master结点上,docker inspect kube-scheduler找到,在此基础上,加入policy-config-file: /etc/kubernetes/ssl/scheduler-policy-config.json即可
2. 将scheduler-policy-config.json拷贝到/etc/kubernetes/ssl/目录下,在每个master节点上都要执行
cd /etc/kubernetes/ssl/
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json
3. 启动gpushare-schd-extender,
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
kubectl create -f gpushare-schd-extender.yaml
注意默认的gpushare-schd-extender.yaml配置是不在master节点上启动gpushare-schd-extender的,如果你的GPU恰巧在master结点上,则要将node-role.kubernetes.io/master相关的NoSchedule配置删掉
nodeSelector:
node-role.kubernetes.io/master: ""
4. 部署gpushare-device-plugin
注意,如果之前装过nvidia-device-plugin,则要先将其删除掉
kubectl delete ds -n kube-system nvidia-device-plugin-daemonset
之后:
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f device-plugin-rbac.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
kubectl create -f device-plugin-ds.yaml
5. 为GPU结点打标签
为了将GPU程序调度到带有GPU的服务器,需要给服务打标签gpushare=true
通过命令行的执行方式为:
#kubectl label node <target_node> gpushare=true
#如果我们的GPU服务器主机名为GPU_NODE
kubectl label node GPU_NODE gpushare=true
6. 更新kubectl可执行程序:
cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare
7. 获取示例程序:
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/1.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/2.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/3.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/4.yaml
根据需要分别使用kubectl create -f 来运行示例
pytorch的实例程序:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch
labels:
app: pytorch
spec:
replicas: 1
selector:
matchLabels:
app: pytorch
template:
metadata:
labels:
app: pytorch
spec:
containers:
- name: pytorch
image: pytorch/pytorch
args: [/bin/sh, -c, 'while true ;do sleep 1000;done']
resources:
limits:
aliyun.com/gpu-mem: 2
volumeMounts:
- name: workspace
mountPath: /workspace #在docker镜像内的路径
readOnly: false #读写权限
volumes:
- name: workspace
hostPath:
path: /var/pytorch #在宿主机上对应的路径
tensorflow的示例程序:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorflow-gpu-jupyter
labels:
app: tensorflow-gpu-jupyter
spec:
replicas: 1
selector:
matchLabels:
app: tensorflow-gpu-jupyter
template:
metadata:
labels:
app: tensorflow-gpu-jupyter
spec:
containers:
- name: tensorflow-gpu-jupyter
image: tensorflow/tensorflow:latest-gpu-jupyter
resources:
limits:
aliyun.com/gpu-mem: 3
---
apiVersion: v1
kind: Service
metadata:
name: tensorflow-gpu-jupyter
labels:
app: tensorflow-gpu-jupyter
spec:
type: NodePort
ports:
- port: 8888
targetPort: 8888
nodePort: 30567
selector:
app: tensorflow-gpu-jupyter
更多推荐
所有评论(0)