在一次运维中发现客户加载数据集会卡住,物理机总共是4块显卡。使用k8s独占显卡进行任务训练,其中有三块显卡在跑任务训练加载数据集时卡住,同时查看dmesg报错 (xid 31)。

[Tue Apr 13 09:45:31 2021] NVRM: Xid (PCI:0000:3b:00): 31, pid=3659, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fe1_d6f54000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

[Wed Mar 31 13:21:35 2021] NVRM: Xid (PCI:0000:af:00): 31, pid=30947, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fb3_5ef54000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

[Sat Mar 20 02:14:25 2021] NVRM: Xid (PCI:0000:86:00): 31, pid=8432, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fdc_16f54000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

00000000:18:00.0 正常


通过官方文档介绍,可能由于驱动问题导致此次报错。
https://docs.nvidia.com/deploy/xid-errors/index.html

于是更新显卡驱动:

原驱动:NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1

删除gpu任务

重启物理机

下载最新驱动 : wget https://us.download.nvidia.com/XFree86/Linux-x86_64/460.67/NVIDIA-Linux-x86_64-460.67.run


service lightdm stop
systemctl stop gpu-share
卸载原来的驱动   nvidia-uninstall
chmod a+x NVIDIA-Linux-x86_64-460.67.run
./NVIDIA-Linux-x86_64-460.67.run --no-x-check --no-nouveau-check --no-opengl-files

安装过程中一些选项


The distribution-provided pre-install script failed! Are you sure you want to continue? 
选择 yes 继续。

问题大概是:Nvidia's 32-bit compatibility libraries?
选择 No 继续。

Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.
选择 Yes 继续
重启
验证
nvidia-smi 
内核版本 4.15.0-45-generic

Logo

K8S/Kubernetes社区为您提供最前沿的新闻资讯和知识内容

更多推荐