如何租用AutoDL显卡进行模型训练(不支持K8s部署)
如何租用AutoDL显卡跑项目使用步骤1、租用新实例参考AutoDL-GPU租用平台使用教程,AutoDL快速开始2、安装个人版XShell 7 + xftp7,注意要先下XShell,再下载xftp,否则xftp下载时会报-1603致命错误。参考XShell安装3、使用XShell连接服务器,使用xftp上传代码到/root/auto-tmp下,因为根目录是系统盘(20G),auto-tmp为挂
如何租用AutoDL显卡跑项目
文章目录
使用步骤
-
1、租用新实例 参考AutoDL-GPU租用平台使用教程,AutoDL快速开始
-
2、安装个人版XShell 7 + xftp7,注意要先下XShell,再下载xftp,否则xftp下载时会报-1603致命错误。参考XShell安装
-
3、使用XShell连接服务器,使用xftp上传代码到
/root/auto-tmp
下,因为根目录是系统盘(20G),auto-tmp
为挂载盘(100G) -
4、创建并激活虚拟环境:(不建议直接在root下装环境)
conda create -n fire_environment python=3.7 # 构建一个虚拟环境,名为:fire_environment conda init bash && source /root/.bashrc # 更新bashrc中的环境变量 conda activate fire_environment # 切换到创建的虚拟环境:fire_environment conda info -e #查看已有的环境
-
5、在JupyterLab的notebook中使用Conda虚拟环境
# 将新的Conda虚拟环境加入jupyterlab中 conda activate fire_environment # 切换到创建的虚拟环境:fire_environment conda install ipykernel ipython kernel install --user --name=fire_environment # 设置kernel,--user表示当前用户,fire_environment为虚拟环境名称
-
6、省钱小秘招
-
在配置项目环境时(带宽不够,下载很耗时),可以用无卡模式开机,费用为0.1元/小时(还好一开始有代金券,否则懊悔死了)
-
不用跑项目的时候记得关机,不然会按时计费。
-
睡觉之前最好不要拿来跑小项目,有卡模式下的开机会正常扣费,建议不要一次性充太多钱(就好比水卡忘记拔了,不停地扣费)。
-
-
7、TensorBoard使用:将项目中log文件夹下的event文件保存到/root/tf-logs
或者切换默认log文件路径,参考AutoDL使用Tensorboard
踩坑
1、GPU 3090不适配cu101版本的torch
-
注意GPU 3090并不适配cu101版本的torch,会报错:
/root/miniconda3/envs/fire_environment/lib/python3.7/site-packages/torch/cuda/__init__.py:143: UserWarning: NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
-
解决方法:安装cuda11.0版本的pytorch:
#卸载cuda(我之前用conda安装的pytorch) conda uninstall pytorch conda uninstall libtorch pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
参考pytorch与gpu版本的适配问题,cuda ubuntu安装_3090显卡 + pytorch1.7 +cuda11.0+anconda安装
2、opencv-python安装低版本才可以支持blend_truth_mosaic
函数
1、无论是在GPU 3090的虚拟机上,还是在GPU 2080TI的虚拟机上,使用CPU运行yolov4项目都会报错;但是在Colab的虚拟环境上,使用CPU就没问题。
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_21071/3778713862.py in <module>
626 config=cfg,
627 epochs=cfg.TRAIN_EPOCHS,
--> 628 device=device, )
629 except KeyboardInterrupt:
630 torch.save(model.state_dict(), 'INTERRUPTED.pth')
/tmp/ipykernel_21071/3778713862.py in train(model, device, config, epochs, batch_size, save_cp, log_step, img_scale)
370
371 with tqdm(total=n_train, desc=f'Epoch {epoch + 1}/{epochs}', unit='img', ncols=50) as pbar:
--> 372 for i, batch in enumerate(train_loader):
373 global_step += 1
374 epoch_step += 1
~/miniconda3/envs/fire_environment/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
519 if self._sampler_iter is None:
520 self._reset()
--> 521 data = self._next_data()
522 self._num_yielded += 1
523 if self._dataset_kind == _DatasetKind.Iterable and \
~/miniconda3/envs/fire_environment/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
559 def _next_data(self):
560 index = self._next_index() # may raise StopIteration
--> 561 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
562 if self._pin_memory:
563 data = _utils.pin_memory.pin_memory(data)
~/miniconda3/envs/fire_environment/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
47 def fetch(self, possibly_batched_index):
48 if self.auto_collation:
---> 49 data = [self.dataset[idx] for idx in possibly_batched_index]
50 else:
51 data = self.dataset[possibly_batched_index]
~/miniconda3/envs/fire_environment/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
47 def fetch(self, possibly_batched_index):
48 if self.auto_collation:
---> 49 data = [self.dataset[idx] for idx in possibly_batched_index]
50 else:
51 data = self.dataset[possibly_batched_index]
~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master1/dataset.py in __getitem__(self, index)
379
380 out_img, out_bbox = blend_truth_mosaic(out_img, ai, truth.copy(), self.cfg.w, self.cfg.h, cut_x,
--> 381 cut_y, i, left_shift, right_shift, top_shift, bot_shift)
382 out_bboxes.append(out_bbox)
383 # print(img_path)
~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master1/dataset.py in blend_truth_mosaic(out_img, img, bboxes, w, h, cut_x, cut_y, i_mixup, left_shift, right_shift, top_shift, bot_shift)
224 if i_mixup == 1:
225 bboxes = filter_truth(bboxes, cut_x - right_shift, top_shift, w - cut_x, cut_y, cut_x, 0)
--> 226 out_img[:cut_y, cut_x:] = img[top_shift:top_shift + cut_y, cut_x - right_shift:w - right_shift]
227 if i_mixup == 2:
228 bboxes = filter_truth(bboxes, left_shift, cut_y - bot_shift, cut_x, h - cut_y, 0, cut_y)
ValueError: could not broadcast input array from shape (320,121,3) into shape (320,204,3)
主要是前者在blend_truth_mosaic
函数上报:OpenCV can't augment image:xxx
错误。查了一下colab的opencv-python版本,是4.1.2,而在AutoDL虚拟环境里手动安装的opencv-python是4.5.5版本,将AutoDl中的cv2修改到4.1.2版本之后,即可运行yolov4项目,项目代码(https://github.com/Tianxiaomo/pytorch-YOLOv4)。
3、训练好后的权重文件没有保存
昨晚训练的模型(花了我10 RMB),早上起来发现模型的权重文件(Yolo-v4.pth
)在checkpoint
文件夹里找不到了,我猜测主要原因是在保存模型权重文件时仅对同一个文件进行修改:
save_path = os.path.join(config.checkpoints, f'{save_prefix}.pth')
而不是创建新的文件(文件前缀前加个epoch)
save_path = os.path.join(config.checkpoints, f'{save_prefix + str(epoch)}.pth')
但是我之前在colab上是没有出现这个问题的,权重文件会自动保存在挂载的网盘中。
save_path = os.path.join(config.checkpoints, f'{save_prefix}.pth')
if save_cp:
try:
# os.mkdir(config.checkpoints)
os.makedirs(config.checkpoints, exist_ok=True)
logging.info('Created checkpoint directory')
except OSError:
pass
# save_path = os.path.join(config.checkpoints, f'{save_prefix}.pth')
torch.save(model.state_dict(), save_path)
logging.info(f'Checkpoint {epoch + 1} saved !')
saved_models.append(save_path)
if len(saved_models) > config.keep_checkpoint_max > 0:
model_to_remove = saved_models.popleft()
try:
os.remove(model_to_remove)
except:
logging.info(f'failed to remove {model_to_remove}')
后台输出结果也说明权重文件确实是保存成功的
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.079
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.173
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.061
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.037
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.093
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.086
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.164
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.251
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.256
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.283
2022-03-06 01:37:28,656 2316355449.py[line:449] INFO: Created checkpoint directory
2022-03-06 01:37:28,656 2316355449.py[line:449] INFO: Created checkpoint directory
2022-03-06 01:37:29,101 2316355449.py[line:454] INFO: Checkpoint 151 saved !
2022-03-06 01:37:29,101 2316355449.py[line:454] INFO: Checkpoint 151 saved !
验证假设:
1)将每个epoch下模型训练的参数写到一个新的pth文件中(这里运行300个epoch,每个权重文件224M,生成所有的文件占空间约300 x 224 / 1024 = 71.48GB),接着通过控制台cd到checkpoints中,可以查看权重文件成功写入(注意在jupyterLab下无法打开checkpoints文件夹查看文件,这是AutoDL客服告诉我的,需要在控制台cd之后才行)。
(base) root@container-a698118c3c-de1a3f0f:~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master# cd checkpoints/
(base) root@container-a698118c3c-de1a3f0f:~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master/checkpoints# ls
Yolov4_epoch.pth Yolov4_epoch0.pth Yolov4_epoch1.pth
(base) root@container-a698118c3c-de1a3f0f:~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master/checkpoints#
2)将每个epoch下模型训练的参数写到一个原来的pth文件中(Yolov4.pth),接着通过控制台cd到checkpoints中,可以查看权重文件成功写入,没有丢失。而且关机之后,权重文件还在。
(base) root@container-a698118c3c-de1a3f0f:~# cd /root/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master/
(base) root@container-a698118c3c-de1a3f0f:~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master# cd checkpoints/
(base) root@container-a698118c3c-de1a3f0f:~/autodl-tmp/Application/pytorch-YOLOv4-Darknet53-master/checkpoints# ls
Yolov4_epoch0.pth Yolov4_epoch1.pth Yolov4_epoch2.pth Yolov4_epoch3.pth Yolov4_epoch4.pth Yolov4_epoch5.pth Yolov4_epoch.pth
现在我只能2个猜测:
- 是不是我平台欠费(欠0.31元),导致权重文件持久化失败了;
- 或者是AutoDL在分布式存储,虚拟化处理时存在Bug,无法实现高可用。
下面这两个猜测我也不去验证了,反正这次体验算是马马虎虎。
4、AutoDL实例是镜像不是虚拟机,不支持k8s工程部署
在AutoDL租用的服务器上,想试着部署之江天枢AI平台 ,但是在部署k8s的时候,提示Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
,网上查了很多,都说是没有启动docker,建议使用systemctl start docker.service
,但是试了还是不济于事。
找了AutoDL客服,客服说AutoDL创建的pytorch虚拟环境实例是nividia-docker镜像,不是虚拟机,因此不能在docker中再安装docker(就是不能套娃),也不能在该服务器上部署K8s。
参考资料
更多推荐
所有评论(0)