We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
This issue is trying to train reproduce Bert training on two GPUs based on NVIDIA/benchmarks/bert/implementations/pytorch
We have tried two different methods, but neither has worked.
In short, we have two issues
Please see our implementation below with the error messages
model = torch.nn.DataParallel(model)
world_size = 1
torch.nn.parallel.DistributedDataParallel
NCLL
env://
NCCL INFO Launch mode Parallel
RuntimeError: Caught RuntimeError in replica 1 on device 1
training_results_v2.0/NVIDIA/benchmarks/bert/implementations/pytorch/run_and_time.sh
Lines 123 to 129 in dae524b
GPU0, GPU1
numactl_args: 0-11,24-35 is not valid
If someone have any suggestions for this?
Many thanks!
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Hi,
This issue is trying to train reproduce Bert training on two GPUs based on NVIDIA/benchmarks/bert/implementations/pytorch
We have tried two different methods, but neither has worked.
In short, we have two issues
Please see our implementation below with the error messages
Slurm Implementation (mainly for two GPUs training):
model = torch.nn.DataParallel(model)
withworld_size = 1
but it hangs forevertorch.nn.parallel.DistributedDataParallel
cannot be initialised properly with theNCLL
,env://
methodNCCL INFO Launch mode Parallel
RuntimeError: Caught RuntimeError in replica 1 on device 1
training_results_v2.0/NVIDIA/benchmarks/bert/implementations/pytorch/run_and_time.sh
Lines 123 to 129 in dae524b
GPU0, GPU1
numactl_args: 0-11,24-35 is not valid
(machine: 24 cores and 48 threads)Docker containers Implementation:
If someone have any suggestions for this?
Many thanks!
The text was updated successfully, but these errors were encountered: