Error details: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
This error occurs when using torch.nn.parallel.DistributedDataParallel to train a model parallelly. I launched program A with python -m torch.distributed.launch --nproc_per_node=2 trainA.py and worked fine. Then when A is running, I tried to launch program B with python -m torch.distributed.launch --nproc_per_node=2 trainB.py yet ended up with the error above.
It turns out that the issue arises from the network address. As the error reports, the address 29500 is being used. Hence, modifying the address should work. So I used the command python -m torch.distributed.launch --nproc_per_node=2 --master_port='29501' trainB.py.
Problem solved!!!

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐