python加载数据集卡住 dmesg报错Nvidia xid31
在一次运维中发现客户加载数据集会卡住,物理机总共是4块显卡。使用k8s独占显卡进行任务训练,其中有三块显卡在跑任务训练加载数据集时卡住,同时查看dmesg报错 (xid 31)。[Tue Apr 13 09:45:31 2021] NVRM: Xid (PCI:0000:3b:00): 31, pid=3659, Ch 00000010, intr 00000000. MMU Fault: ENG
·
在一次运维中发现客户加载数据集会卡住,物理机总共是4块显卡。使用k8s独占显卡进行任务训练,其中有三块显卡在跑任务训练加载数据集时卡住,同时查看dmesg报错 (xid 31)。
[Tue Apr 13 09:45:31 2021] NVRM: Xid (PCI:0000:3b:00): 31, pid=3659, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fe1_d6f54000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[Wed Mar 31 13:21:35 2021] NVRM: Xid (PCI:0000:af:00): 31, pid=30947, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fb3_5ef54000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[Sat Mar 20 02:14:25 2021] NVRM: Xid (PCI:0000:86:00): 31, pid=8432, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fdc_16f54000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
00000000:18:00.0 正常
通过官方文档介绍,可能由于驱动问题导致此次报错。
https://docs.nvidia.com/deploy/xid-errors/index.html
于是更新显卡驱动:
原驱动:NVIDIA-SMI 455.23.04 Driver Version: 455.23.04 CUDA Version: 11.1
删除gpu任务
重启物理机
下载最新驱动 :
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/460.67/NVIDIA-Linux-x86_64-460.67.run
service lightdm stop
systemctl stop gpu-share
卸载原来的驱动 nvidia-uninstall
chmod a+x NVIDIA-Linux-x86_64-460.67.run
./NVIDIA-Linux-x86_64-460.67.run --no-x-check --no-nouveau-check --no-opengl-files
安装过程中一些选项
The distribution-provided pre-install script failed! Are you sure you want to continue?
选择 yes 继续。
问题大概是:Nvidia's 32-bit compatibility libraries?
选择 No 继续。
Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.
选择 Yes 继续
重启
验证
nvidia-smi
内核版本 4.15.0-45-generic
更多推荐
已为社区贡献2条内容
所有评论(0)