一,问题描述:

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

        今天遇到的问题很奇怪,在相同的虚拟环境下,运行一个Pytorch的程序,在Pycharm中运行正常,但是通过命令行启动就会报上面的错误。

        而且在另一台服务器上,也是相同的环境,Pytorch版本一致,正常运行,这就导致这个问题更加奇怪。


二,报错详细信息

提示:这里描述项目中遇到的问题:

例如:数据传输过程中数据不时出现丢失的情况,偶尔会丢失一部分数据
APP 中接收数据代码:

Traceback (most recent call last):
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/aigc/workspace/CFM/code_mp_pipeline.py", line 292, in code_former_core
    num_det_faces = face_helper.get_face_landmarks_5(
  File "/data/aigc/workspace/CFM/facelib/utils/face_restoration_helper.py", line 155, in get_face_landmarks_5
    bboxes = self.face_det.detect_faces(input_img)
  File "/data/aigc/workspace/CFM/facelib/detection/retinaface/retinaface.py", line 211, in detect_faces
    loc, conf, landmarks, priors = self.__detect_faces(image)
  File "/data/aigc/workspace/CFM/facelib/detection/retinaface/retinaface.py", line 158, in __detect_faces
    loc, conf, landmarks = self(inputs)
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/aigc/workspace/CFM/facelib/detection/retinaface/retinaface.py", line 123, in forward
    out = self.body(inputs)
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torchvision/models/_utils.py", line 69, in forward
    x = module(x)
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/dell/anaconda3/envs/CFM/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

三,原因分析:

        第一时间,我询问了ChatGPT, 给出的问题并不能解决我的问题

        询问了同事,同事说因为是CUBLAS的问题,可以尝试卸载虚拟环境中的CUBLAS相关包试试,成功。

        也就是卸载了nvidia-cublas-cu11=11.10.3.66, 不过具体原因还是没有搞清楚。


 

四,解决方案:

pip uninstall nvidia-cublas-cu11

Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐