RuntimeError: All tensor operands to scatter/gather must have the same size #58

Alraemon · 2020-08-14T02:43:58Z

My Environment:

Using docker image on DockerHub pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel

Package Version
apex 0.1
backcall 0.2.0
beautifulsoup4 4.9.1
certifi 2020.6.20
cffi 1.14.0
chardet 3.0.4
conda 4.8.4
conda-build 3.18.11
conda-package-handling 1.7.0
cryptography 2.9.2
decorator 4.4.2
filelock 3.0.12
glob2 0.7
idna 2.9
ipython 7.16.1
ipython-genutils 0.2.0
jedi 0.17.1
Jinja2 2.11.2
libarchive-c 2.9
MarkupSafe 1.1.1
mkl-fft 1.1.0
mkl-random 1.1.1
mkl-service 2.3.0
numpy 1.18.5
olefile 0.46
parso 0.7.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.0.2
pkginfo 1.5.0.1
prompt-toolkit 3.0.5
protobuf 3.12.4
psutil 5.7.0
ptyprocess 0.6.0
pycosat 0.6.3
pycparser 2.20
Pygments 2.6.1
pyOpenSSL 19.1.0
PySocks 1.7.1
pytz 2020.1
PyYAML 5.3.1
requests 2.23.0
ruamel-yaml 0.15.87
setuptools 46.4.0.post20200518
six 1.14.0
soupsieve 2.0.1
tensorboardX 2.1
torch 1.6.0
torchvision 0.7.0
tqdm 4.46.0
traitlets 4.3.3
urllib3 1.25.8
wcwidth 0.2.5
wheel 0.34.2

Linux 9daadec1cf6e 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

My Issue:

When I run command sh tool/train.sh ade20k spnet50x, an RuntimeError was raised:

Traceback (most recent call last):
  File "tool/train.py", line 404, in <module>
    main()
  File "tool/train.py", line 81, in main
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/workspace/SPNet/tool/train.py", line 219, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "/workspace/SPNet/tool/train.py", line 259, in train
    output, main_loss, aux_loss = model(input, target)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/apex/parallel/distributed.py", line 560, in forward
    result = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/workspace/SPNet/models/spnet.py", line 27, in forward
    _, _, c3, c4 = self.base_forward(x)
  File "/workspace/SPNet/models/base.py", line 60, in base_forward
    x = self.pretrained.conv1(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
    return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
  File "/opt/conda/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 39, in forward
    torch.distributed.all_gather(mean_l, mean, process_group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: All tensor operands to scatter/gather must have the same size

/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))

err.log

I got few valuable answer after googling, and the full error log is attached. Could anyone give me some advice?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: All tensor operands to scatter/gather must have the same size #58

RuntimeError: All tensor operands to scatter/gather must have the same size #58

Alraemon commented Aug 14, 2020

RuntimeError: All tensor operands to scatter/gather must have the same size #58

RuntimeError: All tensor operands to scatter/gather must have the same size #58

Comments

Alraemon commented Aug 14, 2020

My Environment:

Using docker image on DockerHub pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel

My Issue: