gpushare-scheduler-extender是阿里云在kubernetes平台上开发的针对GPU进行虚拟化的方案,

首先,参考k8s集群中GPU结点的配置_vah101的专栏-CSDN博客,安装k8s-deviece-plugin,并将/etc/docker/daemon.json配置为:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

1. 修个原有集群的kube-scheduler的配置参数

rancher与原生的kubernetes略有不同,

它的kube-scheduler并不是一个可执行程序,而是一个docker镜像。gpushare-scheduler-extender需要将修改kube-scheduler的配置文件,所以就需要首先对rke的配置进行修改,将新增加的配置信息加入到rek的yml中。

在rancher的集群列表中,选择一个集群,点击其右侧按钮对应的“升级”按钮

点击“编辑YAML”按钮,在其中的services之下加入如下内容:

    scheduler:
      extra_args:
        address: 0.0.0.0
        kubeconfig: /etc/kubernetes/ssl/kubecfg-kube-scheduler.yaml
        leader-elect: 'true'
        policy-config-file: /etc/kubernetes/ssl/scheduler-policy-config.json
        profiling: 'false'
        v: '2'

注意,这里的内容可以冲master结点上,docker inspect kube-scheduler找到,在此基础上,加入policy-config-file: /etc/kubernetes/ssl/scheduler-policy-config.json即可

2. 将scheduler-policy-config.json拷贝到/etc/kubernetes/ssl/目录下,在每个master节点上都要执行

cd /etc/kubernetes/ssl/
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json

3. 启动gpushare-schd-extender,

curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
kubectl create -f gpushare-schd-extender.yaml

注意默认的gpushare-schd-extender.yaml配置是不在master节点上启动gpushare-schd-extender的,如果你的GPU恰巧在master结点上,则要将node-role.kubernetes.io/master相关的NoSchedule配置删掉

      nodeSelector:
         node-role.kubernetes.io/master: ""

4. 部署gpushare-device-plugin

注意,如果之前装过nvidia-device-plugin,则要先将其删除掉

kubectl delete ds -n kube-system nvidia-device-plugin-daemonset

之后:

wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f device-plugin-rbac.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
kubectl create -f device-plugin-ds.yaml

5. 为GPU结点打标签

为了将GPU程序调度到带有GPU的服务器,需要给服务打标签gpushare=true

通过命令行的执行方式为:

#kubectl label node <target_node> gpushare=true
#如果我们的GPU服务器主机名为GPU_NODE
kubectl label node GPU_NODE gpushare=true

6. 更新kubectl可执行程序:

cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare

7. 获取示例程序:

wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/1.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/2.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/3.yaml
wget https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/samples/4.yaml

根据需要分别使用kubectl create -f 来运行示例

pytorch的实例程序:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch
  labels:
    app: pytorch
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: pytorch
  template: 
    metadata:
      labels:
        app: pytorch
    spec:
      containers:
      - name: pytorch
        image: pytorch/pytorch
        args: [/bin/sh, -c, 'while true ;do   sleep 1000;done']
        resources:
          limits:
            aliyun.com/gpu-mem: 2
        volumeMounts:
        - name: workspace
          mountPath: /workspace        #在docker镜像内的路径 
          readOnly: false                          #读写权限
      volumes:
      - name: workspace
        hostPath:
          path: /var/pytorch          #在宿主机上对应的路径

tensorflow的示例程序:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-gpu-jupyter
  labels:
    app: tensorflow-gpu-jupyter
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: tensorflow-gpu-jupyter
  template: 
    metadata:
      labels:
        app: tensorflow-gpu-jupyter
    spec:
      containers:
      - name: tensorflow-gpu-jupyter
        image: tensorflow/tensorflow:latest-gpu-jupyter
        resources:
          limits:
            aliyun.com/gpu-mem: 3
---
apiVersion: v1
kind: Service
metadata:
 name: tensorflow-gpu-jupyter
 labels:
  app: tensorflow-gpu-jupyter
spec:
 type: NodePort
 ports:
 - port: 8888
   targetPort: 8888
   nodePort: 30567
 selector:
  app: tensorflow-gpu-jupyter

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐