You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linux 9daadec1cf6e 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
My Issue:
When I run command sh tool/train.sh ade20k spnet50x, an RuntimeError was raised:
Traceback (most recent call last):
File "tool/train.py", line 404, in <module>
main()
File "tool/train.py", line 81, in main
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/workspace/SPNet/tool/train.py", line 219, in main_worker
loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
File "/workspace/SPNet/tool/train.py", line 259, in train
output, main_loss, aux_loss = model(input, target)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/workspace/SPNet/models/spnet.py", line 27, in forward
_, _, c3, c4 = self.base_forward(x)
File "/workspace/SPNet/models/base.py", line 60, in base_forward
x = self.pretrained.conv1(x)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
File "/opt/conda/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 39, in forward
torch.distributed.all_gather(mean_l, mean, process_group)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: All tensor operands to scatter/gather must have the same size
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
len(cache))
My Environment:
Using docker image on DockerHub pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+=====
| 0 GeForce GTX 108... Off | 00000000:02:00.0 On | N/A |
| 34% 47C P8 15W / 250W | 269MiB / 11177MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 99% 90C P2 222W / 250W | 6295MiB / 11178MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 35% 48C P8 10W / 250W | 6305MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 39% 53C P8 10W / 250W | 12MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Package Version
apex 0.1
backcall 0.2.0
beautifulsoup4 4.9.1
certifi 2020.6.20
cffi 1.14.0
chardet 3.0.4
conda 4.8.4
conda-build 3.18.11
conda-package-handling 1.7.0
cryptography 2.9.2
decorator 4.4.2
filelock 3.0.12
glob2 0.7
idna 2.9
ipython 7.16.1
ipython-genutils 0.2.0
jedi 0.17.1
Jinja2 2.11.2
libarchive-c 2.9
MarkupSafe 1.1.1
mkl-fft 1.1.0
mkl-random 1.1.1
mkl-service 2.3.0
numpy 1.18.5
olefile 0.46
parso 0.7.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.0.2
pkginfo 1.5.0.1
prompt-toolkit 3.0.5
protobuf 3.12.4
psutil 5.7.0
ptyprocess 0.6.0
pycosat 0.6.3
pycparser 2.20
Pygments 2.6.1
pyOpenSSL 19.1.0
PySocks 1.7.1
pytz 2020.1
PyYAML 5.3.1
requests 2.23.0
ruamel-yaml 0.15.87
setuptools 46.4.0.post20200518
six 1.14.0
soupsieve 2.0.1
tensorboardX 2.1
torch 1.6.0
torchvision 0.7.0
tqdm 4.46.0
traitlets 4.3.3
urllib3 1.25.8
wcwidth 0.2.5
wheel 0.34.2
Linux 9daadec1cf6e 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
My Issue:
When I run command
sh tool/train.sh ade20k spnet50x
, an RuntimeError was raised:err.log
I got few valuable answer after googling, and the full error log is attached. Could anyone give me some advice?
The text was updated successfully, but these errors were encountered: