Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: All tensor operands to scatter/gather must have the same size #58

Open
Alraemon opened this issue Aug 14, 2020 · 0 comments

Comments

@Alraemon
Copy link

My Environment:

Using docker image on DockerHub pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+=====
| 0 GeForce GTX 108... Off | 00000000:02:00.0 On | N/A |
| 34% 47C P8 15W / 250W | 269MiB / 11177MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 99% 90C P2 222W / 250W | 6295MiB / 11178MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 35% 48C P8 10W / 250W | 6305MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 39% 53C P8 10W / 250W | 12MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Package Version
apex 0.1
backcall 0.2.0
beautifulsoup4 4.9.1
certifi 2020.6.20
cffi 1.14.0
chardet 3.0.4
conda 4.8.4
conda-build 3.18.11
conda-package-handling 1.7.0
cryptography 2.9.2
decorator 4.4.2
filelock 3.0.12
glob2 0.7
idna 2.9
ipython 7.16.1
ipython-genutils 0.2.0
jedi 0.17.1
Jinja2 2.11.2
libarchive-c 2.9
MarkupSafe 1.1.1
mkl-fft 1.1.0
mkl-random 1.1.1
mkl-service 2.3.0
numpy 1.18.5
olefile 0.46
parso 0.7.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.0.2
pkginfo 1.5.0.1
prompt-toolkit 3.0.5
protobuf 3.12.4
psutil 5.7.0
ptyprocess 0.6.0
pycosat 0.6.3
pycparser 2.20
Pygments 2.6.1
pyOpenSSL 19.1.0
PySocks 1.7.1
pytz 2020.1
PyYAML 5.3.1
requests 2.23.0
ruamel-yaml 0.15.87
setuptools 46.4.0.post20200518
six 1.14.0
soupsieve 2.0.1
tensorboardX 2.1
torch 1.6.0
torchvision 0.7.0
tqdm 4.46.0
traitlets 4.3.3
urllib3 1.25.8
wcwidth 0.2.5
wheel 0.34.2

Linux 9daadec1cf6e 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

My Issue:

When I run command sh tool/train.sh ade20k spnet50x, an RuntimeError was raised:

Traceback (most recent call last):
  File "tool/train.py", line 404, in <module>
    main()
  File "tool/train.py", line 81, in main
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/workspace/SPNet/tool/train.py", line 219, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "/workspace/SPNet/tool/train.py", line 259, in train
    output, main_loss, aux_loss = model(input, target)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/apex/parallel/distributed.py", line 560, in forward
    result = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/workspace/SPNet/models/spnet.py", line 27, in forward
    _, _, c3, c4 = self.base_forward(x)
  File "/workspace/SPNet/models/base.py", line 60, in base_forward
    x = self.pretrained.conv1(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
    return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
  File "/opt/conda/lib/python3.7/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 39, in forward
    torch.distributed.all_gather(mean_l, mean, process_group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1185, in all_gather
    work = _default_pg.allgather([tensor_list], [tensor])
RuntimeError: All tensor operands to scatter/gather must have the same size

/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))

err.log

I got few valuable answer after googling, and the full error log is attached. Could anyone give me some advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant