Kubernetes GPU环境搭建

Kubernetes GPU 环境搭建适应场景k8s集群v1.13.0以上版本，调用GPU显卡计算资源，以支持TensorFlow，Caffe和PyTorch等AI应用。准备工作1.centos7系统上安装v1.13.0以上版本k8s集群，且服务器有nvidia显卡；2.安装nvidia显卡驱动，并确保显卡驱动版本与nvidia library的版本一致；搭建环境docker运行时更新repo源d

夏修理

966人浏览 · 2020-07-07 12:15:38

夏修理 · 2020-07-07 12:15:38 发布

K8S GPU 资源调度

适应场景

K8S集群v1.13.0以上版本，调用GPU显卡计算资源，以支持基于TensorFlow，Caffe和PyTorch框架下的AI应用，如视频转码、人脸识别、发票识别和内容审核等。

准备工作

centos7.5+系统上安装v1.13.0以上版本k8s集群，且机器有nvidia显卡；
安装nvidia显卡驱动，并确保显卡驱动版本与nvidia library的版本一致。

搭建环境

docker运行时

更新repo源

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo

安装gpu运行时

yum install nvidia-container-runtime

修改docker

vim /etc/docker/daemon.json

// daemon.json文件示例：
{
  "graph": "/data/docker/runtime",
  "insecure-registries": ["hub.xxx.xxx.com"],
  "default-runtime": "nvidia",
  "max-concurrent-downloads": 10,
  "max-concurrent-uploads": 5,
  "tls": false,
  "log-level": "info",
  "exec-root": "/data/docker/exec",
  "storage-driver": "overlay2",
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  },
  "hosts": ["unix:///var/run/docker.sock","tcp://0.0.0.0:6372"]
}

重启组件

systemctl restart docker kubelet kube-proxy

测试gpu

// 进入容器后，执行nvidia-smi，若无报错，则GPU依赖正常
docker run -it registry.cn-hangzhou.aliyuncs.com/docker_learning_aliyun/caffe:v1 /bin/bash

K8S GPU调度插件

安装nvidia-device-plugin插件

kubectl apply -f nvidia-device-plugin.yaml

// nvidia-device-plugin.yaml文件示例
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # added by X.L.Xia
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta4
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

标记GPU k8s-worker节点

kubectl label node xxx.xxx.xxx.xx accelerator=nvidia-tesla-t4

测试k8s gpu应用

// transcode-dep.yaml文件示例
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: transcode-dep
spec:
  replicas: 1
  selector:
    matchLabels:
      app: transcode
  template:
    metadata:
      labels:
        app: transcode
    spec:
      containers:
      - image: jstranscodeserver:1.8.5
        name: transcode-container
        ports:
        - containerPort: 80
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1

// transcode-svc文件示例
apiVersion: v1
kind: Service
metadata:
  labels:
    app: transcode
  name: transcode
spec:
  type: NodePort
  ports:
  - name: http
    port: 80
    nodePort: 35001
    targetPort: http
  selector:
    app: transcode

参考资料

1. nvidia-container-runtime
2. Docker - 基于NVIDIA-Docker的Caffe-GPU环境搭建
 3. 从零开始入门 K8s | GPU 管理和 Device Plugin 工作机制

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub