Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce bert on 1 node with 2 GPUs #9

Open
3 of 10 tasks
xihajun opened this issue Nov 23, 2022 · 0 comments
Open
3 of 10 tasks

Reproduce bert on 1 node with 2 GPUs #9

xihajun opened this issue Nov 23, 2022 · 0 comments

Comments

@xihajun
Copy link

xihajun commented Nov 23, 2022

Hi,

This issue is trying to train reproduce Bert training on two GPUs based on NVIDIA/benchmarks/bert/implementations/pytorch

We have tried two different methods, but neither has worked.

  • Submit the job on slurm (with one node: controller and compute node shared the same machine)
  • Run on two containers (gpu0: master node, gpu1: compute-node)

In short, we have two issues

  • two GPUs is not working together on one node using slurm
  • the model doesn't seem to learn (maybe the data issue?)

Please see our implementation below with the error messages

Slurm Implementation (mainly for two GPUs training):

  • ✅One GPU works fine but the loss is not dropping (see chart below)
  • ❌Try to add model = torch.nn.DataParallel(model) with world_size = 1 but it hangs forever
  • ❌The torch.nn.parallel.DistributedDataParallel cannot be initialised properly with the NCLL, env:// method
  • Error after NCCL INFO Launch mode Parallel
  • Error messages for training on two GPUs: RuntimeError: Caught RuntimeError in replica 1 on device 1
  • ❌Tried to srun run_and_time.sh directly
    if [[ -n "${SLURM_LOCALID-}" ]] && [[ "${SLURM_NTASKS}" -gt "${SLURM_JOB_NUM_NODES}" ]]; then
    # Mode 1: Slurm launched a task for each GPU and set some envvars
    CMD=( './bind.sh' '--cpu=exclusive' '--ib=single' '--cluster=${cluster}' '--' ${NSYSCMD} 'python' '-u')
    else
    # docker or single gpu, no need to bind
    CMD=( ${NSYSCMD} 'python' '-u' )
    fi
    • fixed the IB device by replacing cluster to our own, and device name to GPU0, GPU1
    • ./bind.sh doesn't work well with error numactl_args: 0-11,24-35 is not valid (machine: 24 cores and 48 threads)

Docker containers Implementation:

  • ✅Solve the communication problem
  • ❌ However, the loss doesn't drop

If someone have any suggestions for this?

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant