K8S如何使用GPU
K8S如何使用GPU一 GPU节点安装NVIDIA显卡驱动1 查看显卡信息lspci | grep -i vga02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)3d:00.0 VGA compatible controller: NVIDIA Corporation
K8S如何使用GPU
一 GPU节点安装NVIDIA显卡驱动
1 查看显卡信息
lspci | grep -i vga
02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
3d:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
45:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
47:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1)
lspci -v -s 3d:00.0
3d:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 12ae
Physical Slot: 0
Flags: bus master, fast devsel, latency 0, IRQ 637, NUMA node 0
Memory at b7000000 (32-bit, non-prefetchable) [size=16M]
Memory at 3affe0000000 (64-bit, prefetchable) [size=256M]
Memory at 3afff0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 8000 [size=128]
[virtual] Expansion ROM at b8000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Capabilities: [bb0] #15
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
2 禁用系统默认的nouveau显卡驱动
查看nouveau是否启动
lsmod | grep nouveau
修改/etc/modprobe.d/blacklist.conf 文件,加入一行并保存
blacklist nouveau
vim /etc/modprobe.d/blacklist.conf
如果不存在,执行:
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
重新建立initramfs image文件
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
重启
reboot
再次查看nouveau是否启动如果结果为空即为禁用成功
lsmod | grep nouveau
3 安装依赖包
查看内核版本
uname -r
3.10.0-957.el7.x86_64
安装对应依赖包
yum install kernel-devel-$(uname -r) gcc dkms
若没有合适的依赖包需手动下载下来然后安装
下载地址 http://vault.centos.org/7.6.1810/os/x86_64/Packages/
如:
rpm -ivh kernel-headers-3.10.0-957.el7.x86_64.rpm
rpm -ivh kernel-devel-3.10.0-957.el7.x86_64.rpm
4 根据显卡和内核信息下载合适的显卡驱动并安装
下载地址 https://www.nvidia.cn/Download/index.aspx?lang=cn
添加可执行权限
chmod u+x NVIDIA-Linux-x86_64-440.100.run
安装驱动,同时禁用NVIDIA默认的OpenGL界面,在驱动目录下
./NVIDIA-Linux-x86_64-440.100.run --no-opengl-files
执行nvidia-smi命令出现下面信息则安装成功
nvidia-smi
Mon Sep 28 10:46:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:3D:00.0 Off | N/A |
| 22% 27C P8 18W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 22% 26C P8 8W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:45:00.0 Off | N/A |
| 22% 27C P8 1W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:47:00.0 Off | N/A |
| 22% 27C P8 17W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
二 GPU节点安装nvidia-docker2
1 前提条件:
GNU/Linux x86_64 with kernel version > 3.10
Docker >= 1.12
NVIDIA GPU with Architecture > Fermi (2.1)
NVIDIA drivers ~= 361.93 (untested on older versions)
2 设置源
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
3 安装
sudo yum install nvidia-docker2
sudo pkill -SIGHUP dockerd
离线环境中需要手动下载下来对应包
yum install --downloadonly --downloaddir=<your_dir> <package-name>
这里下载后打包成nvidia-docker2_rpms.zip
解压
unzip nvidia-docker2_rpms.zip
离线安装nvidia-docker2
yum install libnvidia-container1-1.0.0-0.1.beta.1.x86_64.rpm
yum install libnvidia-container-tools-1.0.0-0.1.beta.1.x86_64.rpm
yum install nvidia-container-runtime-hook-1.3.0-1.x86_64.rpm
yum install nvidia-container-runtime-2.0.0-1.docker18.03.0.x86_64.rpm
yum install nvidia-docker2-2.0.3-1.docker18.03.0.ce.noarch.rpm
4 安装后需要配置新的Docker Runtime 同时也需要把默认的Runtime设为 nvidia
编辑/etc/docker/daemon.json
vim /etc/docker/daemon.json
内容如下,保存
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
5 重启docker
systemctl restart docker
6 查看docker信息
docker info
Containers: 52
Running: 26
Paused: 0
Stopped: 26
Images: 35
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: nvidia runc
Default Runtime: nvidia
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871-dirty (expected: 4fc53a81fb7c994640722ac585fa9ca548971871)
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 251.4GiB
Name: localhost.localdomain
ID: TIHU:Q6IU:7POJ:WGJV:4RDS:RB7A:MDNH:5OTN:XIVW:BUZQ:Y7OI:FOGX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
7 使用cuda镜像测试(离线环境中需要手动下载对应的镜像)
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
内容如下,代表成功
Mon Sep 28 09:05:46 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:3D:00.0 Off | N/A |
| 22% 27C P8 19W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 22% 26C P8 7W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:45:00.0 Off | N/A |
| 22% 27C P8 1W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:47:00.0 Off | N/A |
| 22% 27C P8 16W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
三 nvidia-device-plugin插件安装
1 前提条件:
NVIDIA drivers ~= 361.93
nvidia-docker version > 2.0 (see how to install and it's prerequisites)
docker configured with nvidia as the default runtime.
Kubernetes version = 1.11
2 master节点安装(离线环境中需要手动下载对应的文件和镜像)
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
由于我k8s集群不支持apiVersion: extensions/v1beta1而且就一个节点有GPU所以修改了nvidia-device-plugin.yml内容如下
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
nodeName: 192.168.10.28
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvidia/k8s-device-plugin:1.11
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
执行修改后的nvidia-device-plugin.yml
kubectl create -f nvidia-device-plugin.yml
3 查看插件对应的pod
kubectl get pod -nkube-system |grep nvidia
nvidia-device-plugin-daemonset-xgd2d 1/1 Running 0 26h
4 查看GUP节点的信息
kubectl describe node 192.168.10.28
Name: 192.168.10.28
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/zone=default
kubernetes.io/arch=amd64
kubernetes.io/hostname=192.168.10.28
kubernetes.io/os=linux
os.architecture=amd64
os.name=CentOS_Linux_7_Core
os.version=3.10.0-957.el7.x86_64
Annotations: node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.10.28/24
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.106.86
CreationTimestamp: Fri, 04 Sep 2020 11:27:08 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: 192.168.10.28
AcquireTime: <unset>
RenewTime: Mon, 28 Sep 2020 17:44:33 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sun, 27 Sep 2020 20:03:08 +0800 Sun, 27 Sep 2020 20:03:08 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Mon, 28 Sep 2020 17:42:17 +0800 Fri, 25 Sep 2020 16:45:11 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 28 Sep 2020 17:42:17 +0800 Fri, 25 Sep 2020 16:45:11 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 28 Sep 2020 17:42:17 +0800 Fri, 25 Sep 2020 16:45:11 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 28 Sep 2020 17:42:17 +0800 Fri, 25 Sep 2020 16:45:21 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.10.28
Hostname: 192.168.10.28
Capacity:
cce/eni: 10
cpu: 40
ephemeral-storage: 51175Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263573464Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cce/eni: 10
cpu: 39830m
ephemeral-storage: 48294789041
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 251373528Ki
nvidia.com/gpu: 4
pods: 110
System Info:
Machine ID: b0cc0937e42441da9c493765ae349d5c
System UUID: 0E5D05B4-4D2F-03E4-B211-D21D80451C1B
Boot ID: 599a1158-3afb-4362-ab5e-6dcd126f40b6
Kernel Version: 3.10.0-957.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.3.0
Kubelet Version: v1.17.5
Kube-Proxy Version: v1.17.5
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
cce-monitor cop-addon-log-agent-fluent-bit-mbwdd 200m (0%) 2 (5%) 200Mi (0%) 200Mi (0%) 24d
cce-monitor copaddon-prometheus-node-exporter-qcbn7 200m (0%) 400m (1%) 256Mi (0%) 400Mi (0%) 24d
cce statefulset-jyr4rdwy-0 100m (0%) 1 (2%) 300Mi (0%) 2Gi (0%) 3d
istio-system istio-policy-748f9ff8dc-b8f7c 200m (0%) 1250m (3%) 256Mi (0%) 1Gi (0%) 3d
kube-system calico-node-zbtcq 250m (0%) 1 (2%) 300Mi (0%) 2Gi (0%) 21h
kube-system nvidia-device-plugin-daemonset-xgd2d 100m (0%) 1 (2%) 300Mi (0%) 2Gi (0%) 26h
projectsn1yesim test02-0 100m (0%) 300m (0%) 100Mi (0%) 300Mi (0%) 3d
projectvnajfc8x test006-0 100m (0%) 100m (0%) 100Mi (0%) 100Mi (0%) 3d
projectvnajfc8x test0911-0 100m (0%) 300m (0%) 100Mi (0%) 300Mi (0%) 3d
test01 deployment-laj6ry6y-54696bff5f-zs2m5 200m (0%) 1500m (3%) 428Mi (0%) 2560Mi (1%) 3d
test01 deployment-xuqz01-5b8858cddf-gm6cp 200m (0%) 1500m (3%) 428Mi (0%) 2560Mi (1%) 3d
test01 deployment-xuqz02-v2-v3-6b86846699-g5lch 200m (0%) 1500m (3%) 428Mi (0%) 2560Mi (1%) 3d
test01 deployment-xuqz02-v2-v3-v4-7dbdb5b56-kqdx8 200m (0%) 1500m (3%) 428Mi (0%) 2560Mi (1%) 3d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2150m (5%) 13350m (33%)
memory 3624Mi (1%) 18708Mi (7%)
ephemeral-storage 0 (0%) 0 (0%)
cce/eni 0 0
nvidia.com/gpu 0 0
Events: <none>
5 测试
创建cuda-vector-add0.1.yaml文件内容如下
cuda-vector-add0.1.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
nodeName: 192.168.10.28
containers:
- name: cuda-vector-add
image: "tingweiwu/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
创建cuda-vector-add
kubectl create -f cuda-vector-add0.1.yaml
pod/cuda-vector-add created
查看pod
kubectl get pod
NAME READY STATUS RESTARTS AGE
cuda-vector-add 0/1 Completed 0 12s
查看pod日志
kubectl logs cuda-vector-add
内容如下
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
若想保持STATUS 一直是running 修改cuda-vector-add0.1.yaml文件如下
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
nodeName: 192.168.10.28
containers:
- name: cuda-vector-add
image: "tingweiwu/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]
若登录容器内执行nvidia-smi命令 kubectl exec -it cuda-vector-add bash nvidia-smi 出现Failed to initialize NVML: Unknown Error 错误
需修改cuda-vector-add0.1.yaml文件如下
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
nodeName: 192.168.10.28
containers:
- name: cuda-vector-add
image: "tingweiwu/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
securityContext:
privileged: true
command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]
更多推荐
所有评论(0)