[QUESTION]NCCL Timeout Error during train_val_test_data_provider #1076

zmtttt · 2024-08-30T06:16:00Z

zmtttt
Aug 30, 2024

NCCL Timeout Error during train_val_test_data_provider
I encountered an NCCL timeout error during the train_val_test_data_provider step while using 8 GPUs (2, 2, 2).
Why am I experiencing this issue?
Perhaps the NCCL timeout is related to communication problems?
It seems that both too large and too small data sets can cause NCCL timeouts.
The code was struck when :
train_data_iterator, valid_data_iterator, test_data_iterator
= build_train_valid_test_data_iterators(
train_valid_test_dataset_provider)

errors:
[E ProcessGroupNCCL.cpp:737] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:737] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801313 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=749, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800258 milliseconds before timing out.
Fatal Python error: Aborted

thanks !!!!

zmtttt · 2024-08-30T06:16:28Z

zmtttt
Aug 30, 2024
Author

@dweekly @aaronp24 @jaredcasper @sublee

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]NCCL Timeout Error during train_val_test_data_provider #1076

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[QUESTION]NCCL Timeout Error during train_val_test_data_provider #1076

zmtttt Aug 30, 2024

Replies: 1 comment

zmtttt Aug 30, 2024 Author

zmtttt
Aug 30, 2024

zmtttt
Aug 30, 2024
Author