You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NCCL Timeout Error during train_val_test_data_provider
I encountered an NCCL timeout error during the train_val_test_data_provider step while using 8 GPUs (2, 2, 2).
Why am I experiencing this issue?
Perhaps the NCCL timeout is related to communication problems?
It seems that both too large and too small data sets can cause NCCL timeouts.
The code was struck when :
train_data_iterator, valid_data_iterator, test_data_iterator
= build_train_valid_test_data_iterators(
train_valid_test_dataset_provider)
errors:
[E ProcessGroupNCCL.cpp:737] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:737] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801313 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
Fatal Python error: Aborted
This discussion was converted from issue #1051 on September 04, 2024 18:17.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
NCCL Timeout Error during train_val_test_data_provider
I encountered an NCCL timeout error during the train_val_test_data_provider step while using 8 GPUs (2, 2, 2).
Why am I experiencing this issue?
Perhaps the NCCL timeout is related to communication problems?
It seems that both too large and too small data sets can cause NCCL timeouts.
The code was struck when :
train_data_iterator, valid_data_iterator, test_data_iterator
= build_train_valid_test_data_iterators(
train_valid_test_dataset_provider)
errors:
[E ProcessGroupNCCL.cpp:737] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:737] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801313 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
Fatal Python error: Aborted
thanks !!!!
Beta Was this translation helpful? Give feedback.
All reactions