Nvidia Docker:CUDA_ERROR_SYSTEM_DRIVER_MISMATCH (大坑)

1. 背景

LZ在自己小笔记本上配了一套docker的镜像 (别问为什么不在服务器上配环境了,手机热点已哭晕) ,在自己的笔记本山test也通过了,然后呢,就打包发到服务上去,在跑代码的时候报错,错误如下:

2020-04-26 09:23:02.639483: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-04-26 09:23:02.640155: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
2020-04-26 09:23:02.640194: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 2d6362a3620d
2020-04-26 09:23:02.640211: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 2d6362a3620d
2020-04-26 09:23:02.640297: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.33.1
2020-04-26 09:23:02.640327: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.14.0
2020-04-26 09:23:02.640341: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 430.14.0 does not match DSO version 440.33.1 -- cannot find working devices in this configuration
2020-04-26 09:23:02.640601: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-04-26 09:23:02.645703: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2808000000 Hz
2020-04-26 09:23:02.645989: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562918084c60 executing computations on platform Host. Devices:
2020-04-26 09:23:02.646009: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
WARNING:tensorflow:From /root/anaconda3/envs/tensorflow1.14_gpu/lib/python3.7/site-packages/tensorflow/python/util/tf_should_use.py:193: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be remo

别看晕拉, LZ把关键几句剃出来

首先出现一个问题,cuda没有初始化,原因是系统驱动不匹配

failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination

到底是哪不匹配呢

 libcuda reported version is: 440.33.1
 kernel reported version is: 430.14.0
 kernel version 430.14.0 does not match DSO version 440.33.1 -- cannot find working devices in this configuration

结果导致tensorflow调用不了服务器上gpu,用cpu在跑程序,那LZ使用服务器的价值何在…

2. 解决方案

定位到驱动的问题就好办了,找到对应cuda的库文件,在/usr/lib/x86_64-linux-gnu

ll libcuda.so*

在这里插入图片描述
LZ惊喜的发现居然有两个版本的libcuda.so,一个是440,另一个是430的,LZ通过invidia-smi,可以看到是430版本的驱动
在这里插入图片描述

后面只需要把关于440的软链接给换成430的即可,问题解决!

3. 产生的原因

主要LZ的小笔记本上安装的是440的驱动,用这个做的镜像链接也都是440版本的,而服务器上的驱动是430版本的,所以产生上述错误,好在,问题解决了,程序飞快的跑起来了,默默敲代码去了,

吐槽一波,nvidia驱动,cuda,cudnn这个问题真的很麻烦,有些服务器上的驱动版本有些都很老了,结果cuda高版本都不支持,很多新出来的框架也无法更新,是时候把服务器重新整一波了…

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐