Reproduce bert on 1 node with 2 GPUs #9

xihajun · 2022-11-23T15:21:47Z

Hi,

This issue is trying to train reproduce Bert training on two GPUs based on NVIDIA/benchmarks/bert/implementations/pytorch

We have tried two different methods, but neither has worked.

Submit the job on slurm (with one node: controller and compute node shared the same machine)
Run on two containers (gpu0: master node, gpu1: compute-node)

In short, we have two issues

two GPUs is not working together on one node using slurm
the model doesn't seem to learn (maybe the data issue?)

Please see our implementation below with the error messages

Slurm Implementation (mainly for two GPUs training):

✅One GPU works fine but the loss is not dropping (see chart below)
❌Try to add model = torch.nn.DataParallel(model) with world_size = 1 but it hangs forever
❌The torch.nn.parallel.DistributedDataParallel cannot be initialised properly with the NCLL, env:// method
Error after NCCL INFO Launch mode Parallel
Error messages for training on two GPUs: RuntimeError: Caught RuntimeError in replica 1 on device 1

❌Tried to srun run_and_time.sh directly

training_results_v2.0/NVIDIA/benchmarks/bert/implementations/pytorch/run_and_time.sh

Lines 123 to 129 in dae524b

    
           if [[ -n "${SLURM_LOCALID-}" ]] && [[ "${SLURM_NTASKS}" -gt "${SLURM_JOB_NUM_NODES}" ]]; then 
        
               # Mode 1: Slurm launched a task for each GPU and set some envvars 
        
               CMD=( './bind.sh' '--cpu=exclusive' '--ib=single' '--cluster=${cluster}' '--' ${NSYSCMD} 'python' '-u') 
        
           else 
        
               # docker or single gpu, no need to bind 
        
               CMD=( ${NSYSCMD} 'python' '-u' ) 
        
           fi

fixed the IB device by replacing cluster to our own, and device name to GPU0, GPU1
./bind.sh doesn't work well with error numactl_args: 0-11,24-35 is not valid (machine: 24 cores and 48 threads)

Docker containers Implementation:

✅Solve the communication problem
❌ However, the loss doesn't drop

If someone have any suggestions for this?

Many thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce bert on 1 node with 2 GPUs #9

Reproduce bert on 1 node with 2 GPUs #9

xihajun commented Nov 23, 2022 •

edited

Loading

Reproduce bert on 1 node with 2 GPUs #9

Reproduce bert on 1 node with 2 GPUs #9

Comments

xihajun commented Nov 23, 2022 • edited Loading

Slurm Implementation (mainly for two GPUs training):

Docker containers Implementation:

xihajun commented Nov 23, 2022 •

edited

Loading