【博客502】Nvidia k8s gpu plugin原理

1、device plugin端启动自己服务, 地址为(/var/lib/kubelet/device-plugins/sock.sock).2、device plugin向地址为(/var/lib/kubelet/device-plugins/kubelet.sock)发送注册请求(含有resoucename以及自己服务的地址/var/lib/kubelet/device-plugins/soc

lulu的云原生笔记

1578人浏览 · 2022-09-24 15:27:17

lulu的云原生笔记 · 2022-09-24 15:27:17 发布

Nvidia k8s gpu plugin原理

Nvidia GPU设备在Kubernetes中管理调度的整个工作流程分为以下两个方面：

1、如何在容器中使用GPU
2、Kubernetes 如何调度GPU

1、如何在容器中使用GPU

想要在容器中的应用可以操作GPU，需要实两个目标：

1、容器中可以查看GPU设备
2、容器中运行的应用，可以通过Nvidia驱动操作GPU显卡

见上篇博客：Nvidia docker原理

2、Kubernetes 如何调度GPU：Nvidia plugin

为了能够在Kubernetes中管理和调度GPU， Nvidia提供了Nvidia GPU的Device Plugin。主要功能如下：

1、支持ListAndWatch 接口，上报节点上的GPU数量
2、支持Allocate接口， 支持分配GPU的行为。

Nvidia plugin k8s的ListAndWatch 与Allocate源码剖析：

// ListAndWatch lists devices and update that list according to the health status
func (plugin *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()})

    for {
        select {
        case <-plugin.stop:
            return nil
        case d := <-plugin.health:
            // 收到某个设备有健康问题，标志该设备不健康
            // FIXME: there is no way to recover from the Unhealthy state.
            d.Health = pluginapi.Unhealthy
            log.Printf("'%s' device marked unhealthy: %s", plugin.rm.Resource(), d.ID)
            // 重新发送新的可用的device列表
            s.Send(&pluginapi.ListAndWatchResponse{Devices: plugin.apiDevices()})
        }
    }
}

// Allocat主要是分配显卡，给容器指定要附加的NVIDIA_VISIBLE_DEVICES环境变量
func (plugin *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    responses := pluginapi.AllocateResponse{}
    // 为每个请求分配设备
    for _, req := range reqs.ContainerRequests {
 
        if plugin.config.Sharing.TimeSlicing.FailRequestsGreaterThanOne && rm.AnnotatedIDs(req.DevicesIDs).AnyHasAnnotations() {
            if len(req.DevicesIDs) > 1 {
                return nil, fmt.Errorf("request for '%v: %v' too large: maximum request size for shared resources is 1", plugin.rm.Resource(), len(req.DevicesIDs))
            }
        }
        // 判断一下申请的设备ID是不是自己所管理的，也就是所拥有的设备，也就是校验是不是自己注册的那些设备
        for _, id := range req.DevicesIDs {
            if !plugin.rm.Devices().Contains(id) {
                return nil, fmt.Errorf("invalid allocation request for '%s': unknown device: %s", plugin.rm.Resource(), id)
            }
        }

        response := pluginapi.ContainerAllocateResponse{}
        // 将注册时的设备ID转换为具体的gpu id
        ids := req.DevicesIDs
        deviceIDs := plugin.deviceIDsFromAnnotatedDeviceIDs(ids)
        // 将分配的设备信息保存到Env里面去，后续docker的runC将设备信息以环境变量的形式注入到容器
        if *plugin.config.Flags.Plugin.DeviceListStrategy == spec.DeviceListStrategyEnvvar {
            response.Envs = plugin.apiEnvs(plugin.deviceListEnvvar, deviceIDs)
        }
        if *plugin.config.Flags.Plugin.DeviceListStrategy == spec.DeviceListStrategyVolumeMounts {
            response.Envs = plugin.apiEnvs(plugin.deviceListEnvvar, []string{deviceListAsVolumeMountsContainerPathRoot})
            response.Mounts = plugin.apiMounts(deviceIDs)
        }
        if *plugin.config.Flags.Plugin.PassDeviceSpecs {
            response.Devices = plugin.apiDeviceSpecs(*plugin.config.Flags.NvidiaDriverRoot, ids)
        }
        if *plugin.config.Flags.GDSEnabled {
            response.Envs["NVIDIA_GDS"] = "enabled"
        }
        if *plugin.config.Flags.MOFEDEnabled {
            response.Envs["NVIDIA_MOFED"] = "enabled"
        }

        responses.ContainerResponses = append(responses.ContainerResponses, &response)
    }

    return &responses, nil
}

整个Kubernetes调度GPU的过程如下：

1、GPU Device plugin 部署到GPU节点上，通过 ListAndWatch  接口，
  上报注册节点的GPU信息和对应的DeviceID。

2、当有声明 nvidia.com/gpu  的GPU Pod创建出现，调度器会综合考虑GPU设备的空闲情况，
   将Pod调度到有充足GPU设备的节点上。

3、节点上的kubelet 启动Pod时，根据request中的声明调用各个Device plugin 的 
     allocate接口， 由于容器声明了GPU。kubelet 根据之前 ListAndWatch 接口
     收到的Device信息，选取合适的设备，DeviceID 作为参数，调用GPU DevicePlugin
     的 Allocate 接口。Nvidia GPU device plugin做的事情，就是根据kubelet 请求中
     的GPU DeviceId， 转换为 NVIDIA_VISIBLE_DEVICES 环境变量返回给kubelet

4、GPU DevicePlugin ，接收到调用，将DeviceID 转换为 NVIDIA_VISIBLE_DEVICES
      环境变量，返回给kubelet

5、kubelet收到返回内容后，会自动将返回的环境变量注入到容器中。启动容器

6、容器启动时， gpu-container-runtime 调用 gpu-containers-runtime-hook 
   Nvidia的 gpu-container-runtime根据容器的 NVIDIA_VISIBLE_DEVICES 环境变量，
   会决定这个容器是否为GPU容器，并且可以使用哪些GPU设备。
   如果没有携带NVIDIA_VISIBLE_DEVICES这个环境变量，
   那么就会按照普通的docker启动方式来启动
   
7、gpu-containers-runtime-hook根据容器的 NVIDIA_VISIBLE_DEVICES 环境变量，
   转换为 --devices 参数，调用 nvidia-container-cli prestart，
   nvidia-container-cli 。根据 --devices ，将GPU设备映射到容器中。
   并且将宿主机的Nvidia Driver Lib 的so文件也映射到容器中。 
   此时容器可以通过这些so文件，调用宿主机的Nvidia Driver。

原理总结

1、device plugin端启动自己服务, 地址为(/var/lib/kubelet/device-plugins/sock.sock).

2、device plugin向地址为(/var/lib/kubelet/device-plugins/kubelet.sock)发送注册请求(含有resoucename以及自己服务的地址/var/lib/kubelet/device-plugins/sock.sock).

3、device manager收到请求分配一个新的endpoint与该device plugin通过device plugin的ListAndWatch进行连接并通信.

4、当device plugin的ListAndWatch有变化时, 对应的endpoint会感知并通过回调函数告知device manager需要更新它的资源以及对应设备信息(healthyDevices和unhealthyDevices)

流程图：

在这里插入图片描述

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub