【博客556】检验k8s下gpu环境是否就绪

检验k8s下gpu环境是否就绪。

lulu的云原生笔记

694人浏览 · 2022-12-12 21:51:22

lulu的云原生笔记 · 2022-12-12 21:51:22 发布

检验k8s下gpu环境是否就绪

1、检测本地驱动层环境是否就绪

nvidia-smi
Mon Dec 12 11:59:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0 Off |                  N/A |
| 30%   32C    P0    N/A / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2、检测docker层gpu驱动环境是否就绪

sudo docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

3、检测docker环境下gpu设备是否可用

制作检测镜像，Dockerfile如下：

FROM nvidia/cuda:11.1.1-devel AS builder

WORKDIR /build

COPY . /build/

RUN make

FROM nvidia/cuda:11.1.1-runtime

COPY --from=builder /build/gpu_burn /app/
COPY --from=builder /build/compare.cu /app/

WORKDIR /app

CMD ["./gpu_burn", "60"]


docker build . -t gpu-burn:cuda11.1

docker测试gpu可用性：

sudo docker run -it  --gpus=all  registry.cn-hangzhou.aliyuncs.com/mkmk/all:gpu-burn-cuda11.1  "/app/gpu_burn"  "10"
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-837c707b-1370-883f-d4d2-6da24bdaaba7)
Initialized device 0 with 24268 MB of memory (23744 MB available, using 21370 MB of it), using FLOATS
30.0%  proc'd: 1333 (9347 Gflop/s)   errors: 0   temps: 43 C
    Summary at:   Mon Dec 12 04:02:49 UTC 2022

60.0%  proc'd: 2666 (14759 Gflop/s)   errors: 0   temps: 43 C
    Summary at:   Mon Dec 12 04:02:52 UTC 2022

80.0%  proc'd: 5332 (14807 Gflop/s)   errors: 0   temps: 54 C
    Summary at:   Mon Dec 12 04:02:54 UTC 2022

100.0%  proc'd: 6665 (14763 Gflop/s)   errors: 0   temps: 54 C
    Summary at:   Mon Dec 12 04:02:57 UTC 2022


Killing processes.. Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
    GPU 0: OK

4、检测k8s环境下gpu是否可用

// test pod
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

// 正确结果
$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

K8S/Kubernetes

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐

【深度】阿里巴巴万级规模 K8s 集群全局高可用体系之美

作者 | 韩堂、柘远、沉醉来源 | 阿里巴巴云原生公众号前言台湾作家林清玄在接受记者采访的时候，如此评价自己 30 多年写作生涯：“第一个十年我才华横溢，‘贼光闪现’，令周边黯然失色；第二个十年，我终于‘宝光现形’，不再去抢风头，反而与身边的美丽相得益彰；进入第三个十年，繁华落尽见真醇，我进入了‘醇光初现’的阶段，真正体味到了境界之美”。长夜有穷，真水无香。领略过了 K8s“身在江

K8S/Kubernetes

如何基于 K8s 构建下一代 DevOps 平台？

作者 | 孙健波（天元）导读：当前云原生 DevOps 体系现状如何？面临哪些挑战？如何通过 OAM 解决云原生 DevOps 场景下的诸多问题？云原生开发应用模型 OAM(Open Application Model) 社区核心成员孙健波将为大家一一解答，并分享如何基于 OAM 和 Kubernetes 打造无限能力的下一代 DevOps 平台。什么是 DevOps？为什么基于 Kub