【博客556】检验k8s下gpu环境是否就绪
检验k8s下gpu环境是否就绪。
·
检验k8s下gpu环境是否就绪
1、检测本地驱动层环境是否就绪
nvidia-smi
Mon Dec 12 11:59:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:08:00.0 Off | N/A |
| 30% 32C P0 N/A / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2、检测docker层gpu驱动环境是否就绪
sudo docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
3、检测docker环境下gpu设备是否可用
制作检测镜像,Dockerfile如下:
FROM nvidia/cuda:11.1.1-devel AS builder
WORKDIR /build
COPY . /build/
RUN make
FROM nvidia/cuda:11.1.1-runtime
COPY --from=builder /build/gpu_burn /app/
COPY --from=builder /build/compare.cu /app/
WORKDIR /app
CMD ["./gpu_burn", "60"]
docker build . -t gpu-burn:cuda11.1
docker测试gpu可用性:
sudo docker run -it --gpus=all registry.cn-hangzhou.aliyuncs.com/mkmk/all:gpu-burn-cuda11.1 "/app/gpu_burn" "10"
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-837c707b-1370-883f-d4d2-6da24bdaaba7)
Initialized device 0 with 24268 MB of memory (23744 MB available, using 21370 MB of it), using FLOATS
30.0% proc'd: 1333 (9347 Gflop/s) errors: 0 temps: 43 C
Summary at: Mon Dec 12 04:02:49 UTC 2022
60.0% proc'd: 2666 (14759 Gflop/s) errors: 0 temps: 43 C
Summary at: Mon Dec 12 04:02:52 UTC 2022
80.0% proc'd: 5332 (14807 Gflop/s) errors: 0 temps: 54 C
Summary at: Mon Dec 12 04:02:54 UTC 2022
100.0% proc'd: 6665 (14763 Gflop/s) errors: 0 temps: 54 C
Summary at: Mon Dec 12 04:02:57 UTC 2022
Killing processes.. Freed memory for dev 0
Uninitted cublas
done
Tested 1 GPUs:
GPU 0: OK
4、检测k8s环境下gpu是否可用
// test pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
// 正确结果
$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
更多推荐
已为社区贡献51条内容
所有评论(0)