Ubuntu16.04 + NVIDIA RTX3090 + Pytorch + Tensorflow
说明记录RTX3090显卡显示驱动与cuda计算驱动安装过程, 本文均采用 run 格式的安装文件.显卡驱动安装文件下载cuda 从 这里 下载安装文件 cuda_11.1.0_455.23.05_linux.rundriver 从 这里 下载安装文件 NVIDIA-Linux-x86_64-455.45.01.run, NVIDIA-Linux-x86_64-455.23.04.run一次性安装
文章目录
说明
记录RTX3090显卡显示驱动与cuda计算驱动安装过程, 本文均采用 run
格式的安装文件.
有用链接
显卡驱动安装
文件下载
- cuda 从 这里 下载安装文件
cuda_11.1.0_455.23.05_linux.run
- driver 从 这里 下载安装文件
NVIDIA-Linux-x86_64-455.45.01.run
,NVIDIA-Linux-x86_64-455.23.04.run
一次性安装显示驱动和cuda计算套件
参照 Ubuntu 16.04 LTS + CUDA8.0 + cudnn6.0 中的步骤安装即可.
安装完成后重启,进入/usr/local/cuda-11.1/samples/1_Utilities/deviceQuery
安装目录, 执行 sudo make
命令,接着执行 ./deviceQuery
命令查看设备及驱动等信息.
/usr/local/cuda-11.1/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "GeForce RTX 3090"
CUDA Driver Version / Runtime Version 11.1 / 11.1
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24265 MBytes (25443893248 bytes)
(82) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce GTX 1080 Ti"
CUDA Driver Version / Runtime Version 11.1 / 11.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11178 MBytes (11721506816 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 98304 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce RTX 3090 (GPU0) -> GeForce GTX 1080 Ti (GPU1) : No
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce RTX 3090 (GPU0) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 11.1, NumDevs = 2
Result = PASS
提示: 若事先没有驱动,或者有但不适合3090显卡,一种方法是先不装3090,安装驱动,另一种是装上3090安装,但是后者会提示如下错误
此时,按 Ctrl+Alt+F2切换到命令行模式,注意不要按Ctrl+Alt+F1,否则用 sudo service lightdm stop
可能不能完全关闭 Xserver, 然后安装显示驱动时会报如下错误
[INFO]: Initializing menu
[INFO]: Silent install option: skipping toolkit
[INFO]: Silent install option: skipping samples
[INFO]: Silent install option: skipping toolkit
[INFO]: Silent install option: skipping toolkit
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 455.23.05
[INFO]: Executing NVIDIA-Linux-x86_64-455.23.05.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 455.23.05 failed, quitting
仅安装显示驱动
可以通过下载的cudatoolkit安装包安装(如 cuda_11.1.0_455.23.05_linux.run
), 也可以单独下载显示驱动文件安装(如 NVIDIA-Linux-x86_64-455.45.01.run
)
ERROR: You appear to be running an X server; please exit X before installing. For
further details, please see the section INSTALLING THE NVIDIA DRIVER in the
README available on the Linux driver download page at www.nvidia.com.
提示没有完全关闭 Xserver. 使用 ps aux | grep X
查看, 确实可以发现相关进程, 如果用 sudo service lightdm stop
或 sudo /etc/init.d/lightdm stop
或 sudo /etc/init.d/gdm stop
(如果是gdm桌面) 不能完全关闭 Xserver, 在刚开机快进入桌面, 弹出驱动错误对话框时, 按 Ctrl+Alt+F2切换到命令行模式,注意不要按Ctrl+Alt+F1. 然后重新安装即可.
仅安装cuda计算套件
参照 Ubuntu 16.04 LTS + CUDA8.0 + cudnn6.0 中的步骤安装即可, 安装时仅选择 Cuda Library.
安装Pytorch
安装pytorch1.7
注意: 目前(2020.1.6)conda库里还没有 cuda11.1
对应的pytorch, 所以如果下面的命令输入 cudatoolkit=11.1
则会下载cpu版的pytorch, 故此这里输入 cudatoolkit=11.0
.
conda create -n rtx3090 # 创建新的环境 rtx3090
conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
输入后会提示如下信息:
The following packages will be downloaded:
package | build
---------------------------|-----------------
_libgcc_mutex-0.1 | conda_forge 3 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates-2020.12.5 | ha878542_0 137 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi-2020.12.5 | py36h5fab9bb_0 143 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
cudatoolkit-11.0.3 | h15472ef_6 952.1 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
dataclasses-0.7 | pyhe4b4509_6 21 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
freetype-2.8.1 | hfa320df_1 789 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ld_impl_linux-64-2.35.1 | hea4e1c9_1 617 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libffi-3.3 | h58526e2_2 51 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgcc-ng-9.3.0 | h5dbcf3e_17 7.8 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libpng-1.6.37 | h21135ba_2 306 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libstdcxx-ng-9.3.0 | h2ae2ef3_17 4.0 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libtiff-4.0.9 | he6b73bb_1 521 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libuv-1.40.0 | h7f98852_0 1.0 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
llvm-openmp-11.0.0 | hfc4b9b4_1 2.8 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
mkl-2020.4 | h726a3e6_304 215.6 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
mkl-service-2.3.0 | py36h8c4c3a4_2 54 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
mkl_fft-1.2.0 | py36h68bb277_1 164 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
mkl_random-1.2.0 | py36h7c3b610_1 314 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ninja-1.10.2 | h4bd325d_0 2.4 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy-1.19.2 | py36h54aff64_0 21 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
numpy-base-1.19.2 | py36hfa32c7d_0 5.2 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
olefile-0.46 | pyh9f0ad1d_1 32 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
openssl-1.1.1i | h7f98852_0 2.1 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pillow-5.2.0 | py36_0 1007 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pip-20.3.3 | pyhd8ed1ab_0 1.1 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-3.6.12 |hffdb5ce_0_cpython 38.4 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pytorch-1.7.1 |py3.6_cuda11.0.221_cudnn8.0.5_0 770.6 MB pytorch
setuptools-49.6.0 | py36h9880bd3_2 947 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
sqlite-3.34.0 | h74cdb3f_0 1.4 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tk-8.6.10 | h21135ba_1 3.2 MB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
torchaudio-0.7.2 | py36 9.8 MB pytorch
torchvision-0.8.2 | py36_cu110 17.9 MB pytorch
typing_extensions-3.7.4.3 | py_0 25 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
wheel-0.36.2 | pyhd3deb0d_0 31 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xz-5.2.5 | h516909a_1 343 KB http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
安装完成后, 进入python
解释环境, 输入如下命令查看是否安装成功:
import torch
torch.__version__
torch.cuda.is_available()
torch.cuda.get_device_name(0)
torch.cuda.get_device_name(1)
本人在Python原生环境下配置有 pytorch1.6+cuda10.1
环境, Anaconda下创建的 rtx3090 环境下配置有 pytorch1.7.1+cuda11.0
, 使用上述命令查看配置后的环境,可得到如下结果:
$ cuda10
$ python
Python 3.6.11 (default, Jun 29 2020, 05:15:03)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.6.0+cu101'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3090'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1080 Ti'
>>>
$ cuda11
switch to cuda 11!
$ inconda rtx3090
Switch to rtx3090
(rtx3090) -----$ python
Python 3.6.12 | packaged by conda-forge | (default, Dec 9 2020, 00:36:02)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.7.1'
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3090'
>>> torch.cuda.get_device_name(1)
'GeForce GTX 1080 Ti'
>>>
源码安装pytorch1.8
参照官方步骤, 首先下载pytorch,torchvision源码:
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
在conda中创建新环境 pytorch18
:
conda create -n pytorch18 python=3.7.9 # 创建新的环境 pytorch18
安装公共依赖:
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
# Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda111 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
执行如下命令编译安装:
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install
接下来是漫长的等待过程.
如果提示 Parse error. Expected a command name, got unquoted argument with text
, 可能是你的pytorch中CMakeList.txt的编码方式不对,这有在你从Windows拷贝到Ubuntu时会发生, 修改为utf-8编码即可.
如果提示找不到CUDNN版本(Found cuDNN: v?
), 如下,请检查CUDNN的安装过程,应该出错了,参见官方步骤
-- Caffe2: CUDA detected: 11.1
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.1
-- Found cuDNN: v? (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
CMake Error at
..... /public/cuda.cmake:... (message):
PyTorch requires cuDNN 7 and above.
源码安装torchvision
git clone --recursive https://github.com/pytorch/vision.git
cd vision
python setup.py install
RTX3090性能问题
深度学习
- RTX 3090 Benchmarks for Deep Learning – NVIDIA RTX 3090 vs 2080 Ti vs TITAN RTX vs RTX 6000/8000
- Titan RTX vs RTX 3090 Transformer Benchmarks, Pytorch
- Convolution operations are extremely slow on RTX 30 series GPU
Pytorch上测试结果
不同卷积类型
测试1维卷积,2维卷积以及2维卷积中的1维卷积,在benchmark 和 deterministic取不同值时的性能,此测试仅做前向传播,不做反向传播,测试代码如下
import torch
import torch.nn as nn
import time
device = 'cuda:0'
device = 'cuda:1'
niters = 1000
print("Torch version: ", torch.__version__)
print("Torch CUDA version: ", torch.version.cuda)
print("CUDNN Version: ", torch.backends.cudnn.version())
print(torch.cuda.get_device_name(int(device[-1])))
def profile(model, x, benchmark, deterministic, nb_iters):
torch.backends.cudnn.benchmark = benchmark
torch.backends.cudnn.deterministic = deterministic
# warmup
for _ in range(10):
out = model(x)
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
out = model(x)
torch.cuda.synchronize()
t1 = time.time()
return (t1 - t0) / nb_iters
model1 = nn.Sequential(
nn.Conv1d(24, 256, kernel_size=(12,), stride=(6,), groups=4),
nn.ReLU(),
nn.Conv1d(256, 256, kernel_size=(6,), stride=(3,), padding=(2,), groups=4),
nn.ReLU(),
nn.Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,), groups=4),
nn.ReLU(),
)
model1.to(device=device)
x = torch.randn(64, 24, 224, device=device)
time0 = profile(model1, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv1d model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model1, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv1d model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model1, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv1d model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model1, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv1d model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))
model2 = nn.Sequential(
nn.Conv2d(8, 32, kernel_size=(8, 8), stride=(4, 4)),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
nn.ReLU()
)
model2.to(device=device)
x = torch.randn(64, 8, 224, 224, device=device)
time0 = profile(model2, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv2d model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model2, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv2d model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model2, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv2d model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model2, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv2d model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))
model3 = nn.Sequential(
nn.Conv2d(8, 32, kernel_size=(8, 1), stride=(4, 1)),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=(4, 1), stride=(2, 1), padding=(1, 1)),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=(3, 1), stride=(1, 1), padding=(1, 1)),
nn.ReLU()
)
model3.to(device=device)
x = torch.randn(64, 8, 224, 224, device=device)
time0 = profile(model3, x, benchmark=False, deterministic=False, nb_iters=niters)
print('Conv2d1 model, benchmark=False, deterministic=False, {:.3f}ms/iter'.format(time0*1000))
time1 = profile(model3, x, benchmark=True, deterministic=False, nb_iters=niters)
print('Conv2d1 model, benchmark=True, deterministic=False, {:.3f}ms/iter'.format(time1*1000))
time2 = profile(model3, x, benchmark=False, deterministic=True, nb_iters=niters)
print('Conv2d1 model, benchmark=False, deterministic=True, {:.3f}ms/iter'.format(time2*1000))
time3 = profile(model3, x, benchmark=True, deterministic=True, nb_iters=niters)
print('Conv2d1 model, benchmark=True, deterministic=True, {:.3f}ms/iter'.format(time3*1000))
测试结果如下,由结果可知, 对于2维卷积, 3090比1080ti快了将近1倍, 对于1维卷积提升不大, 另外不同Torch版本也有一定的性能影响:
Torch version: 1.8.0.dev20210106+cu110
Torch CUDA version: 11.0
CUDNN Version: 8005
GeForce RTX 3090
Conv1d model, benchmark=False, deterministic=False, 0.687ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.511ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.540ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.484ms/iter
Conv2d model, benchmark=False, deterministic=False, 1.327ms/iter
Conv2d model, benchmark=True, deterministic=False, 1.335ms/iter
Conv2d model, benchmark=False, deterministic=True, 1.474ms/iter
Conv2d model, benchmark=True, deterministic=True, 1.480ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 3.278ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 3.280ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 3.286ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 3.286ms/iter
Torch version: 1.8.0.dev20210106+cu110
Torch CUDA version: 11.0
CUDNN Version: 8005
GeForce GTX 1080 Ti
Conv1d model, benchmark=False, deterministic=False, 0.709ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.711ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.844ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.711ms/iter
Conv2d model, benchmark=False, deterministic=False, 2.684ms/iter
Conv2d model, benchmark=True, deterministic=False, 2.883ms/iter
Conv2d model, benchmark=False, deterministic=True, 2.212ms/iter
Conv2d model, benchmark=True, deterministic=True, 2.195ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 5.583ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 6.077ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 6.097ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 6.120ms/iter
Torch version: 1.6.0+cu101
Torch CUDA version: 10.1
CUDNN Version: 7603
GeForce GTX 1080 Ti
Conv1d model, benchmark=False, deterministic=False, 0.544ms/iter
Conv1d model, benchmark=True, deterministic=False, 0.542ms/iter
Conv1d model, benchmark=False, deterministic=True, 0.544ms/iter
Conv1d model, benchmark=True, deterministic=True, 0.542ms/iter
Conv2d model, benchmark=False, deterministic=False, 2.149ms/iter
Conv2d model, benchmark=True, deterministic=False, 2.332ms/iter
Conv2d model, benchmark=False, deterministic=True, 2.469ms/iter
Conv2d model, benchmark=True, deterministic=True, 2.483ms/iter
Conv2d1 model, benchmark=False, deterministic=False, 6.097ms/iter
Conv2d1 model, benchmark=True, deterministic=False, 6.637ms/iter
Conv2d1 model, benchmark=False, deterministic=True, 6.658ms/iter
Conv2d1 model, benchmark=True, deterministic=True, 6.679ms/iter
MNIST 分类
训练一个卷积神经网络, 并测试在测试集上的精度, 统计训练和测试耗时, 测试代码如下:
from __future__ import print_function
import argparse
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
device = 'cuda:0'
#device = 'cuda:1'
num_workers = 1
num_workers = 4
batch_size = 64
epochs = 10
benchmark = True
benchmark = False
deterministic = True
#deterministic = False
cudaTF32 = True
#cudaTF32 = False
cudnnTF32 = True
#cudnnTF32 = False
print("Torch Version: ", torch.__version__)
print("Torch CUDA Version: ", torch.version.cuda)
print("CUDNN Version: ", torch.backends.cudnn.version())
print("GPU Device: ", torch.cuda.get_device_name(int(device[-1])))
print("CUDNN Benchmark: ", benchmark)
print("CUDNN Deterministic: ", deterministic)
print("CUDA TF32: ", cudaTF32)
print("CUDNN TF32: ", cudnnTF32)
print("Workers: ", num_workers)
print("Batch Size: ", batch_size)
print("Epochs: ", epochs)
torch.backends.cudnn.benchmark = benchmark
torch.backends.cudnn.deterministic = deterministic
#torch.backends.cuda.matmul.allow_tf32 = cudaTF32
#torch.backends.cudnn.allow_tf32 = cudnnTF32
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
train_loss = 0.
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader.dataset)
return train_loss
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
return test_loss
def main():
global device, num_workers, batch_size, epochs
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=3, metavar='N',
help='number of epochs to train (default: 14)')
parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
help='learning rate (default: 1.0)')
parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
help='Learning rate step gamma (default: 0.7)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--dry-run', action='store_true', default=False,
help='quickly check a single pass')
parser.add_argument('--seed', type=int, default=2020, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--save-model', action='store_true', default=False,
help='For Saving the current Model')
args = parser.parse_args()
use_cuda = not args.no_cuda and torch.cuda.is_available()
torch.manual_seed(args.seed)
device = torch.device(device if use_cuda else "cpu")
args.batch_size = batch_size
args.epochs = epochs
kwargs = {'batch_size': args.batch_size}
if use_cuda:
kwargs.update({'num_workers': num_workers,
'pin_memory': True,
'shuffle': True},
)
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
dataset1 = datasets.MNIST('../data', train=True, download=True,
transform=transform)
dataset2 = datasets.MNIST('../data', train=False,
transform=transform)
train_loader = torch.utils.data.DataLoader(dataset1,**kwargs)
test_loader = torch.utils.data.DataLoader(dataset2, **kwargs)
model = Net().to(device)
optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
tstart = time.time()
train_loss, test_loss = 0., 0.
for epoch in range(1, args.epochs + 1):
train_loss += train(args, model, device, train_loader, optimizer, epoch)
test_loss += test(model, device, test_loader)
scheduler.step()
tend = time.time()
train_loss /= args.epochs
test_loss /= args.epochs
print("Training Loss: ", train_loss)
print("Testing Loss: ", test_loss)
print("Time: %.4f" % (tend - tstart))
if args.save_model:
torch.save(model.state_dict(), "mnist_cnn.pt")
if __name__ == '__main__':
main()
测试结果如下, 由测试结果知, 3090与1080ti性能相当, 甚至还没有1080ti好, 这与官网宣称的性能差距甚远, 另外本人在自己写的比较复杂的网络模型上测试, 3090的性能更差.
Torch Version: 1.8.0.dev20210106+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDNN Benchmark: False
CUDNN Deterministic: True
CUDA TF32: True
CUDNN TF32: True
Workers: 4
Batch Size: 64
Epochs: 10
Training Loss: 0.0009187350056997093
Testing Loss: 0.03026894662413977
Time: 63.0998
Torch Version: 1.8.0.dev20210106+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDNN Benchmark: False
CUDNN Deterministic: True
CUDA TF32: True
CUDNN TF32: True
Workers: 4
Batch Size: 64
Epochs: 10
Training Loss: 0.0008879667648342487
Testing Loss: 0.030606915746741615
Time: 56.9057
Torch Version: 1.6.0+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDNN Benchmark: False
CUDNN Deterministic: True
Workers: 4
Batch Size: 64
Epochs: 10
Training Loss: 0.0009102054347947707
Testing Loss: 0.029809928882313725
Time: 52.8144
由于安装的Pytorch为二进制版, 其对应cuda版本为11.0, 而本文安装cuda版本为11.1, 担心这个会影响3090性能的发挥, 因而重新安装cuda11.0, 测试结果如下, 可见cuda版本没有影响
RTX3090 Pytorch1.8 MNIST 运行3次
RTX1080ti Pytorch1.8 MNIST 运行3次
Tensorflow 上测试结果
CFAR图像分类
所用Tensorflow版本为2.4.0
, CUDA为11.0
, CUDNN为8005
, 数据集为CFAR10, 测试代码如下:
主文件
import os
import time
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
# The CIFAR labels happen to be arrays,
# which is why you need the extra index
plt.xlabel(class_names[train_labels[i][0]])
plt.show()
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.summary()
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
tstart = time.time()
history = model.fit(train_images, train_labels, epochs=10,
validation_data=(test_images, test_labels))
tend = time.time()
print("Training time: ", tend - tstart)
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')
tstart = time.time()
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
tend = time.time()
print("Testing time: ", tend - tstart)
print(test_acc)
测试结果如下, 可见, 在Tensorflow下, 3090的性能也很一般, 跟1080ti差不多:
2021-01-09 22:49:35.037113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22113 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
2021-01-09 22:49:36.122102: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-09 22:49:36.122543: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2298505000 Hz
Epoch 1/10
2021-01-09 22:49:36.553517: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-09 22:49:37.349985: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-09 22:49:37.353723: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-09 22:49:39.345531: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
1563/1563 [==============================] - 13s 6ms/step - loss: 1.7480 - accuracy: 0.3552 - val_loss: 1.2966 - val_accuracy: 0.5368
Epoch 2/10
1563/1563 [==============================] - 16s 11ms/step - loss: 1.1864 - accuracy: 0.5776 - val_loss: 1.1261 - val_accuracy: 0.6031
Epoch 3/10
1563/1563 [==============================] - 9s 6ms/step - loss: 1.0160 - accuracy: 0.6462 - val_loss: 0.9643 - val_accuracy: 0.6648
Epoch 4/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8975 - accuracy: 0.6860 - val_loss: 0.9399 - val_accuracy: 0.6661
Epoch 5/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8137 - accuracy: 0.7145 - val_loss: 0.9458 - val_accuracy: 0.6683
Epoch 6/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.7547 - accuracy: 0.7366 - val_loss: 0.8510 - val_accuracy: 0.7013
Epoch 7/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6963 - accuracy: 0.7557 - val_loss: 0.8670 - val_accuracy: 0.7034
Epoch 8/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6321 - accuracy: 0.7779 - val_loss: 0.8671 - val_accuracy: 0.7068
Epoch 9/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6121 - accuracy: 0.7854 - val_loss: 0.8556 - val_accuracy: 0.7122
Epoch 10/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.5720 - accuracy: 0.7980 - val_loss: 0.8800 - val_accuracy: 0.7110
Training time: 100.08890771865845
313/313 - 1s - loss: 0.8800 - accuracy: 0.7110
Testing time: 0.9198315143585205
0.7110000252723694
-----------------------------------------------------------------------
2021-01-09 22:53:03.004101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10269 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-01-09 22:53:03.788911: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-09 22:53:03.789360: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2298505000 Hz
Epoch 1/10
2021-01-09 22:53:04.196189: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-09 22:53:04.408018: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-09 22:53:04.410362: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
1563/1563 [==============================] - 12s 6ms/step - loss: 1.7363 - accuracy: 0.3623 - val_loss: 1.3144 - val_accuracy: 0.5389
Epoch 2/10
1563/1563 [==============================] - 10s 6ms/step - loss: 1.1831 - accuracy: 0.5816 - val_loss: 1.0454 - val_accuracy: 0.6328
Epoch 3/10
1563/1563 [==============================] - 9s 6ms/step - loss: 1.0309 - accuracy: 0.6423 - val_loss: 0.9749 - val_accuracy: 0.6596
Epoch 4/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.9134 - accuracy: 0.6766 - val_loss: 0.9642 - val_accuracy: 0.6651
Epoch 5/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.8405 - accuracy: 0.7080 - val_loss: 0.9484 - val_accuracy: 0.6706
Epoch 6/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.7702 - accuracy: 0.7287 - val_loss: 0.8654 - val_accuracy: 0.7013
Epoch 7/10
1563/1563 [==============================] - 10s 6ms/step - loss: 0.7279 - accuracy: 0.7445 - val_loss: 0.8597 - val_accuracy: 0.7013
Epoch 8/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6909 - accuracy: 0.7604 - val_loss: 0.9126 - val_accuracy: 0.6914
Epoch 9/10
1563/1563 [==============================] - 9s 6ms/step - loss: 0.6469 - accuracy: 0.7717 - val_loss: 0.9200 - val_accuracy: 0.6951
Epoch 10/10
1563/1563 [==============================] - 10s 6ms/step - loss: 0.6022 - accuracy: 0.7892 - val_loss: 0.8853 - val_accuracy: 0.7042
Training time: 97.05110216140747
313/313 - 1s - loss: 0.8853 - accuracy: 0.7042
Testing time: 1.0514824390411377
0.704200029373169
需要注意的事项
Tensor Float32
根据Pytorch文档所述 TF32 on Ampere, 30系列的GPU支持Tensor Float32类型的计算,如果打开则会使用相关加速单元计算,要比普通的浮点数计算快,但是精度会下降, 在Pytorch中默认是打开的。
# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
torch.backends.cuda.matmul.allow_tf32 = True
# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True
奇怪的现象
基础环境为:
- 系统: Ubuntu16.04
- CUDA: 8.0, 9.0, 10.1, 11.1 共存
第一种Pytorch版本配置:
- 原生Python环境Pytorch配置
1.6.0+cu101
- Anaconda环境Pytorch配置:
1.8.0.dev20210106+cu110
第二种Pytorch版本配置:
- 原生Python环境Pytorch配置
1.9.0.dev20210208+cu110
+Python3.6.11
- Anaconda环境Pytorch配置:
1.8.0.dev20210208+cu110
+Python3.7.3
第三种Pytorch版本配置:
- 原生Python环境Pytorch配置
1.8.0.dev20210208+cu110
+Python3.6.11
- Anaconda环境Pytorch配置:
1.8.0.dev20210208+cu110
+Python3.7.3
最后发现只要 原生Python中的Pytorch版本与Anaconda中的Pytorch版本不一致, 一在GPU上运行程序就会卡死, 只有版本一致时才不卡.
Pytorch 不同版本在不同设备上的性能测试
- 使用
torch.manual_seed(seed)
设置随机数种子, 保证每次运行产生的网络初始权重相同 torch.backends.cudnn.benchmark
是否允许CUDNN自己寻找较快的卷积实现, 不同device, 不同的卷积, 都会带来卷积实现的速度和精度的差异, 若网络结构非动态, 且数据大小不变化, 可设置为True
, 反之, 应设为False
, 否则反而寻找快速实现会暂用大量时间.torch.backends.cudnn.deterministic
是否禁止CUDNN使用不确定性算法, 若为True
则使用确定性算法, 此设置即影响速度又影响精度.torch.backends.cuda.matmul.allow_tf32
是否允许使用TensorFloat32 (TF32)
张量核, 设置为True
会提升速度, 但精度会有损失, 仅ampere
架构GPU支持, 对于不支持的GPU, 此设置不奏效, 不影响torch.backends.cudnn.allow_tf32
是否允许CUDNN使用TensorFloat32 (TF32)
张量核, 设置为True
会提升速度, 但精度会有损失, 仅ampere
架构GPU支持, 对于不支持的GPU, 此设置不奏效, 不影响
- pytorch, nvidia ampere tensor cores :speed vs precision, 精度损失挺大的.
为便于比较, 现将时间测试结果如下:
PyTorch 1.8.0.dev20210208+cu110 + RTX 3090+CUDA11
1. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
2. CUDATF32=False, CUDNNTF32=False, Benchmark=False, Deterministic=True : train: 55s, valid: 14s
3. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
4. CUDATF32=False, CUDNNTF32=False, Benchmark=True, Deterministic=True : train: 11s, valid: 8s
5. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=False : train: 54s, valid: 14s
6. CUDATF32=True, CUDNNTF32=True, Benchmark=False, Deterministic=True : train: 54s, valid: 14s
7. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=False : train: 10s, valid: 8s
8. CUDATF32=True, CUDNNTF32=True, Benchmark=True, Deterministic=True : train: 11s, valid: 8s
PyTorch 1.8.0.dev20210208+cu110 + GTX 1080TI+CUDA11
1. Benchmark=False, Deterministic=False: train: 21s, valid: 20s
2. Benchmark=False, Deterministic=True: train: 10s, valid: 10s
3. Benchmark=True, Deterministic=False: train: 18s, valid: 18s
4. Benchmark=True, Deterministic=True: train: 25s, valid: 18s
PyTorch 1.8.0.dev20210210+cu101 + GTX 1080TI+CUDA10
1. Benchmark=False, Deterministic=False: train: 29s, valid: 20s
2. Benchmark=False, Deterministic=True: train: 24s, valid: 20s
3. Benchmark=True, Deterministic=False: train: 17s, valid: 19s
4. Benchmark=True, Deterministic=True: train: 10s, valid: 10s
3090+CUDA11结果
测试配置如下:
CUDATF32=False, CUDNNTF32=False
,Benchmark=False, Deterministic=False
运行2次CUDATF32=False, CUDNNTF32=False
,Benchmark=False, Deterministic=True
运行2次CUDATF32=False, CUDNNTF32=False
,Benchmark=True, Deterministic=False
运行2次CUDATF32=False, CUDNNTF32=False
,Benchmark=True, Deterministic=True
运行2次CUDATF32=True, CUDNNTF32=True
,Benchmark=False, Deterministic=False
运行2次CUDATF32=True, CUDNNTF32=True
,Benchmark=False, Deterministic=True
运行2次CUDATF32=True, CUDNNTF32=True
,Benchmark=True, Deterministic=False
运行2次CUDATF32=True, CUDNNTF32=True
,Benchmark=True, Deterministic=True
运行2次
由结果可得如下结论:
- 无论
Benchmark, Deterministic
取何种设置, TF32基本无加速, 而由于精度损失导致指标有所下降. Benchmark
加速比较大 (无论CUDATF32
,CUDNNTF32
,Deterministic
取何值)Deterministic
加速比很小 (无论CUDATF32
,CUDNNTF32
,Benchmark
取何值)
时间结果如下:
CUDATF32=False, CUDNNTF32=False
,Benchmark=False, Deterministic=False
: train: 54s, valid: 14sCUDATF32=False, CUDNNTF32=False
,Benchmark=False, Deterministic=True
: train: 55s, valid: 14sCUDATF32=False, CUDNNTF32=False
,Benchmark=True, Deterministic=False
: train: 10s, valid: 8sCUDATF32=False, CUDNNTF32=False
,Benchmark=True, Deterministic=True
: train: 11s, valid: 8sCUDATF32=True, CUDNNTF32=True
,Benchmark=False, Deterministic=False
: train: 54s, valid: 14sCUDATF32=True, CUDNNTF32=True
,Benchmark=False, Deterministic=True
: train: 54s, valid: 14sCUDATF32=True, CUDNNTF32=True
,Benchmark=True, Deterministic=False
: train: 10s, valid: 8sCUDATF32=True, CUDNNTF32=True
,Benchmark=True, Deterministic=True
: train: 11s, valid: 8s
具体结果如下:
CUDATF32=False, CUDNNTF32=False
,Benchmark=False, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7454, time: 51.87094736099243s
--->Valid epoch: 1, loss: 10.0915, entropy: 10.0915, l1norm: 10.0175, contrast: -3.6298, time: 14.616422176361084s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7351, time: 53.827654123306274s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6338, time: 14.419052362442017s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0174, contrast: -3.7504, time: 54.31160569190979s
--->Valid epoch: 1, loss: 10.0903, entropy: 10.0903, l1norm: 10.0172, contrast: -3.6331, time: 14.683520078659058s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7414, time: 54.10149121284485s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6291, time: 14.825896978378296s
CUDATF32=False, CUDNNTF32=False
,Benchmark=False, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7438, time: 55.177640199661255s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6328, time: 14.768057823181152s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7462, time: 55.64263844490051s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6270, time: 14.934310674667358s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7438, time: 54.337199211120605s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6328, time: 14.743611574172974s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7462, time: 55.72549605369568s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6270, time: 14.936981439590454s
CUDATF32=False, CUDNNTF32=False
,Benchmark=True, Deterministic=False
运行2次
GeForce RTX 3090
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0174, contrast: -3.7344, time: 9.696520328521729s
--->Valid epoch: 1, loss: 10.0903, entropy: 10.0903, l1norm: 10.0172, contrast: -3.6324, time: 8.027890682220459s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7427, time: 9.307999849319458s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6288, time: 7.912508487701416s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0175, contrast: -3.7390, time: 10.154499053955078s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6319, time: 7.96985650062561s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0185, contrast: -3.7424, time: 10.21190619468689s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6280, time: 8.24509072303772s
CUDATF32=False, CUDNNTF32=False
,Benchmark=True, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7376, time: 10.749010562896729s
--->Valid epoch: 1, loss: 10.0892, entropy: 10.0892, l1norm: 10.0169, contrast: -3.6317, time: 8.022538900375366s
--->Train epoch: 2, loss: 10.0845, entropy: 10.0845, l1norm: 10.0184, contrast: -3.7403, time: 11.13167119026184s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6277, time: 8.124176025390625s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7376, time: 11.482399225234985s
--->Valid epoch: 1, loss: 10.0892, entropy: 10.0892, l1norm: 10.0169, contrast: -3.6317, time: 8.082707166671753s
--->Train epoch: 2, loss: 10.0845, entropy: 10.0845, l1norm: 10.0184, contrast: -3.7403, time: 11.133780717849731s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6277, time: 7.9467527866363525s
CUDATF32=True, CUDNNTF32=True
,Benchmark=False, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0174, contrast: -3.7520, time: 53.30528664588928s
--->Valid epoch: 1, loss: 10.0915, entropy: 10.0915, l1norm: 10.0175, contrast: -3.6306, time: 14.644577264785767s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7400, time: 53.856117486953735s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6295, time: 14.902146816253662s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7454, time: 52.86871576309204s
--->Valid epoch: 1, loss: 10.0921, entropy: 10.0921, l1norm: 10.0176, contrast: -3.6283, time: 14.82244610786438s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7385, time: 54.21399688720703s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6301, time: 14.981997728347778s
CUDATF32=True, CUDNNTF32=True
,Benchmark=False, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7479, time: 53.832154750823975s
--->Valid epoch: 1, loss: 10.0914, entropy: 10.0914, l1norm: 10.0175, contrast: -3.6294, time: 14.578207015991211s
--->Train epoch: 2, loss: 10.0850, entropy: 10.0850, l1norm: 10.0185, contrast: -3.7328, time: 55.1286780834198s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6291, time: 14.752575635910034s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7479, time: 54.217995166778564s
--->Valid epoch: 1, loss: 10.0914, entropy: 10.0914, l1norm: 10.0175, contrast: -3.6294, time: 14.726275444030762s
--->Train epoch: 2, loss: 10.0850, entropy: 10.0850, l1norm: 10.0185, contrast: -3.7328, time: 58.631773948669434s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6291, time: 15.65324854850769s
CUDATF32=True, CUDNNTF32=True
,Benchmark=True, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7399, time: 9.78132176399231s
--->Valid epoch: 1, loss: 10.0895, entropy: 10.0895, l1norm: 10.0170, contrast: -3.6324, time: 7.887274980545044s
--->Train epoch: 2, loss: 10.0848, entropy: 10.0848, l1norm: 10.0185, contrast: -3.7401, time: 9.359825134277344s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6280, time: 7.946653127670288s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0838, entropy: 10.0838, l1norm: 10.0174, contrast: -3.7392, time: 9.73360824584961s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6307, time: 7.910605192184448s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7455, time: 9.480495691299438s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6302, time: 8.091880798339844s
CUDATF32=True, CUDNNTF32=True
,Benchmark=True, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7435, time: 10.862586498260498s
--->Valid epoch: 1, loss: 10.0924, entropy: 10.0924, l1norm: 10.0177, contrast: -3.6264, time: 7.919918537139893s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7405, time: 10.470875024795532s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6289, time: 7.927625894546509s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce RTX 3090
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0175, contrast: -3.7435, time: 10.894693613052368s
--->Valid epoch: 1, loss: 10.0924, entropy: 10.0924, l1norm: 10.0177, contrast: -3.6264, time: 7.909468173980713s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0184, contrast: -3.7405, time: 10.432015419006348s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6289, time: 7.901139974594116s
1080TI+CUDA11
首先看 CUDA TF32
和 CUDNN TF32
在1080ti上是否起作用, 由下面结果可知不起作用.
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.39651656150818s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.37838625907898s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 60.98785209655762s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.49517011642456s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: True
CUDNN TF32: True
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.536439895629883s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.666481971740723s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.99992823600769s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.800203800201416s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.48216986656189s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.80778193473816s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.893065929412842s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.811289072036743s
下面测试 Benchmark
, Deterministic
的影响, 配置如下
Benchmark=False, Deterministic=False
运行2次Benchmark=False, Deterministic=True
运行2次Benchmark=True, Deterministic=False
运行2次Benchmark=True, Deterministic=True
运行2次
下面给出时间结果:
Benchmark=False, Deterministic=False
: train: 21s, valid: 20sBenchmark=False, Deterministic=True
: train: 10s, valid: 10sBenchmark=True, Deterministic=False
: train: 18s, valid: 18sBenchmark=True, Deterministic=True
: train: 25s, valid: 18s
具体结果如下:
Benchmark=False, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7399, time: 20.35549545288086s
--->Valid epoch: 1, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6324, time: 18.916125535964966s
--->Train epoch: 2, loss: 10.0855, entropy: 10.0855, l1norm: 10.0186, contrast: -3.7225, time: 22.069878578186035s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6277, time: 20.496481895446777s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7452, time: 20.829981803894043s
--->Valid epoch: 1, loss: 10.0902, entropy: 10.0902, l1norm: 10.0172, contrast: -3.6316, time: 20.57017183303833s
--->Train epoch: 2, loss: 10.0841, entropy: 10.0841, l1norm: 10.0183, contrast: -3.7417, time: 22.458086252212524s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6297, time: 20.48271942138672s
Benchmark=False, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.37249255180359s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.48044991493225s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7361, time: 61.695250272750854s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6316, time: 20.484682321548462s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0833, entropy: 10.0833, l1norm: 10.0174, contrast: -3.7472, time: 61.24735903739929s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6336, time: 20.471797704696655s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7361, time: 61.43708038330078s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6316, time: 20.503613471984863s
Benchmark=True, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0841, entropy: 10.0841, l1norm: 10.0176, contrast: -3.7429, time: 18.17175269126892s
--->Valid epoch: 1, loss: 10.0923, entropy: 10.0923, l1norm: 10.0177, contrast: -3.6254, time: 18.15757131576538s
--->Train epoch: 2, loss: 10.0849, entropy: 10.0849, l1norm: 10.0185, contrast: -3.7369, time: 18.792244911193848s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6295, time: 18.823336124420166s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0175, contrast: -3.7487, time: 18.5258526802063s
--->Valid epoch: 1, loss: 10.0930, entropy: 10.0930, l1norm: 10.0178, contrast: -3.6256, time: 18.816319704055786s
--->Train epoch: 2, loss: 10.0852, entropy: 10.0852, l1norm: 10.0185, contrast: -3.7307, time: 18.378268718719482s
--->Valid epoch: 2, loss: 10.0885, entropy: 10.0885, l1norm: 10.0168, contrast: -3.6320, time: 18.800978660583496s
Benchmark=True, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.72299075126648s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.807397603988647s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.79756498336792s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.80388379096985s
Torch Version: 1.8.0.dev20210208+cu110
Torch CUDA Version: 11.0
CUDNN Version: 8005
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7472, time: 25.729687929153442s
--->Valid epoch: 1, loss: 10.0894, entropy: 10.0894, l1norm: 10.0169, contrast: -3.6280, time: 18.8100643157959s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7435, time: 25.809940576553345s
--->Valid epoch: 2, loss: 10.0890, entropy: 10.0890, l1norm: 10.0169, contrast: -3.6303, time: 18.82235860824585s
1080TI+CUDA10
Benchmark=False, Deterministic=False
运行2次Benchmark=False, Deterministic=True
运行2次Benchmark=True, Deterministic=False
运行2次Benchmark=True, Deterministic=True
运行2次
下面给出时间结果:
Benchmark=False, Deterministic=False
: train: 29s, valid: 20sBenchmark=False, Deterministic=True
: train: 24s, valid: 20sBenchmark=True, Deterministic=False
: train: 17s, valid: 19sBenchmark=True, Deterministic=True
: train: 10s, valid: 10s
具体结果如下:
Benchmark=False, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0836, entropy: 10.0836, l1norm: 10.0174, contrast: -3.7381, time: 28.030856609344482s
--->Valid epoch: 1, loss: 10.0925, entropy: 10.0925, l1norm: 10.0177, contrast: -3.6266, time: 19.793615102767944s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7522, time: 29.9300274848938s
--->Valid epoch: 2, loss: 10.0899, entropy: 10.0899, l1norm: 10.0172, contrast: -3.6343, time: 20.407748699188232s
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0174, contrast: -3.7364, time: 28.959442138671875s
--->Valid epoch: 1, loss: 10.0906, entropy: 10.0906, l1norm: 10.0173, contrast: -3.6312, time: 20.42161989212036s
--->Train epoch: 2, loss: 10.0843, entropy: 10.0843, l1norm: 10.0184, contrast: -3.7435, time: 29.956018924713135s
--->Valid epoch: 2, loss: 10.0887, entropy: 10.0887, l1norm: 10.0168, contrast: -3.6298, time: 20.411144495010376s
Benchmark=False, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0173, contrast: -3.7488, time: 23.54142165184021s
--->Valid epoch: 1, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6269, time: 20.411819219589233s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0187, contrast: -3.7421, time: 24.92523694038391s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6279, time: 20.398447513580322s
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: False
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0832, entropy: 10.0832, l1norm: 10.0173, contrast: -3.7488, time: 23.943784952163696s
--->Valid epoch: 1, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6269, time: 20.42457938194275s
--->Train epoch: 2, loss: 10.0846, entropy: 10.0846, l1norm: 10.0187, contrast: -3.7421, time: 25.094601154327393s
--->Valid epoch: 2, loss: 10.0888, entropy: 10.0888, l1norm: 10.0168, contrast: -3.6279, time: 20.471126079559326s
Benchmark=True, Deterministic=False
运行2次
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0834, entropy: 10.0834, l1norm: 10.0175, contrast: -3.7485, time: 16.708702325820923s
--->Valid epoch: 1, loss: 10.0905, entropy: 10.0905, l1norm: 10.0173, contrast: -3.6340, time: 19.010523319244385s
--->Train epoch: 2, loss: 10.0838, entropy: 10.0838, l1norm: 10.0183, contrast: -3.7436, time: 17.553938627243042s
--->Valid epoch: 2, loss: 10.0883, entropy: 10.0883, l1norm: 10.0167, contrast: -3.6318, time: 19.111060619354248s
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: False
--->Train epoch: 1, loss: 10.0840, entropy: 10.0840, l1norm: 10.0175, contrast: -3.7400, time: 16.517553091049194s
--->Valid epoch: 1, loss: 10.0916, entropy: 10.0916, l1norm: 10.0175, contrast: -3.6292, time: 18.997257232666016s
--->Train epoch: 2, loss: 10.0847, entropy: 10.0847, l1norm: 10.0185, contrast: -3.7379, time: 17.554461240768433s
--->Valid epoch: 2, loss: 10.0886, entropy: 10.0886, l1norm: 10.0168, contrast: -3.6306, time: 19.144949674606323s
Benchmark=True, Deterministic=True
运行2次
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0839, entropy: 10.0839, l1norm: 10.0175, contrast: -3.7339, time: 20.768550872802734s
--->Valid epoch: 1, loss: 10.0900, entropy: 10.0900, l1norm: 10.0172, contrast: -3.6320, time: 19.122512578964233s
--->Train epoch: 2, loss: 10.0835, entropy: 10.0835, l1norm: 10.0183, contrast: -3.7527, time: 21.37337899208069s
--->Valid epoch: 2, loss: 10.0892, entropy: 10.0892, l1norm: 10.0170, contrast: -3.6347, time: 19.099023818969727s
Torch Version: 1.8.0.dev20210210+cu101
Torch CUDA Version: 10.1
CUDNN Version: 7603
GPU Device: GeForce GTX 1080 Ti
CUDA TF32: False
CUDNN TF32: False
CUDNN Benchmark: True
CUDNN Deterministic: True
--->Train epoch: 1, loss: 10.0837, entropy: 10.0837, l1norm: 10.0175, contrast: -3.7441, time: 21.45243787765503s
--->Valid epoch: 1, loss: 10.0929, entropy: 10.0929, l1norm: 10.0178, contrast: -3.6257, time: 19.00617289543152s
--->Train epoch: 2, loss: 10.0854, entropy: 10.0854, l1norm: 10.0185, contrast: -3.7285, time: 21.43665385246277s
--->Valid epoch: 2, loss: 10.0889, entropy: 10.0889, l1norm: 10.0168, contrast: -3.6309, time: 19.1219220161438s
更多推荐
所有评论(0)