You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training own dataset, an error occurs when changing numberclasses to the corresponding category. If it is the default, it will report an error
#42
Open
hx358031364 opened this issue
Sep 13, 2021
· 0 comments
AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Scheduled epochs: 310
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "main.py", line 948, in
main()
File "main.py", line 664, in main
optimizers=optimizers)
File "main.py", line 782, in train_one_epoch
output = model(input)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 610, in forward
self._sync_params()
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1048, in _sync_params
authoritative_rank,
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 979, in _distributed_broadcast_coalesced
self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
The text was updated successfully, but these errors were encountered:
AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Scheduled epochs: 310
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [15,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.Traceback (most recent call last):
File "main.py", line 948, in
main()
File "main.py", line 664, in main
optimizers=optimizers)
File "main.py", line 782, in train_one_epoch
output = model(input)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 610, in forward
self._sync_params()
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1048, in _sync_params
authoritative_rank,
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 979, in _distributed_broadcast_coalesced
self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
The text was updated successfully, but these errors were encountered: