宿主机为Ubuntu20.04 + gtx1060,Nvidia driver版本为510.85.02。

安装环境为:tensorrt8.4

安装完成后,一当调用cuda环境就会报错:Error 804: forward compatibility was attempted on non supported HW。

检查问题原因

在Linux宿主机上使用docker(版本>= 19.3)之前,请确保安装了nvidia-container-runtime和nvidia-container-toolkit:

sudo apt-get install nvidia-container-runtime nvidia-container-toolkit

并且确保nvidia-container-runtime-hook在PATH环境变量的路径中:

:~$ which nvidia-container-runtime-hook
/usr/bin/nvidia-container-runtime-hook

cuda初探

既然是个cuda初始化就报错的问题,那Gemfield不妨先抛开PyTorch,在当前的Docker环境上直接写一个最简化的C程序来初始化CUDA设备,看看是否会出错。

代码:

#include <stdio.h>
#include <cuda_runtime.h>
int main() {
    int device = 0;
    int gpuDeviceCount = 0;
    struct cudaDeviceProp properties;

    cudaError_t cudaResultCode = cudaGetDeviceCount(&gpuDeviceCount);

    if (cudaResultCode == cudaSuccess){
        cudaGetDeviceProperties(&properties, device);
        printf("%d GPU CUDA devices(s)(%d)\n", gpuDeviceCount, properties.major);
        printf("\t Product Name: %s\n"          , properties.name);
        printf("\t TotalGlobalMem: %ld MB\n"    , properties.totalGlobalMem/(1024^2));
        printf("\t GPU Count: %d\n"             , properties.multiProcessorCount);
        printf("\t Kernels found: %d\n"         , properties.concurrentKernels);
        return 0;
    }
    printf("\t gemfield error: %d\n",cudaResultCode);
}

编译:

g++ -I/usr/local/cuda-11.2/targets/x86_64-linux/include/ gemfield.cpp -o gemfield -L/usr/local/cuda-11.2/targets/x86_64-linux/lib/ -lcudart
~# ./gemfield
         gemfield error: 804

Error 804: forward compatibility was attempted on non supported HW”,这个错误的意思是说:你的硬件不支持forward compatibility。

解决办法

很简单,将宿主主机的nvidia显卡驱动更新成与镜像相同的版本,然后再次安装nvidia-container-runtime和nvidia-container-toolkit:

显卡驱动安装请参考:环境搭建01——Ubuntu如何查看显卡信息及安装NVDIA显卡驱动_命名无能的博客-CSDN博客_ubuntu如何查看显卡驱动

本文参考

PyTorch的CUDA错误:Error 804: forward compatibility was attempted on non supported HW - 知乎

如有侵权,请联系删除。

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐