Data parallelism best practices #3598

mikesol · 2024-01-03T10:03:09Z

mikesol
Jan 3, 2024

Hi all!

I've been using TPUs for flax training and it's been working quite nicely with @pmap. I'm now switching to a multi-gpu rig and I'm wondering if the setup will work the same? I've had to manually specify:

jax.distributed.initialize(
    coordinator_address=local_env.coordinator_address,
    num_processes=local_env.num_processes,
    process_id=local_env.process_id,
)

which is fine but I'm guessing that will only work on one core and then I'd need to launch more processes to get more parallelism?

Also, I've read about sharding and I'm not quite sure what the current recommendation is for using ensembling vs sharding. It seems like there's some conceptual overlap there as one can shard over the batch dimension?

Thanks in advance for any tips about how folks usually set these things up on a multi-gpu environment!

cgarciae · 2024-01-04T15:04:39Z

cgarciae
Jan 4, 2024
Maintainer

Hey @mikesol! Using the Sharding APIs + jax.jit is the preferred way now a days. Take a look at the Scale up Flax Modules on multiple devices guide.

0 replies

mikesol · 2024-01-04T15:09:34Z

mikesol
Jan 4, 2024
Author

Thanks! Would it make a sense to add a brief note to that effect on the Parallel training page? Something that briefly describes the difference and displays a grid showing what hardware supports what convention?

For example, earlier today, I deployed a model using a mesh on 8 gpus & it worked just fine. But I tried the same on a v2-32 TPU and it didn't work. So I'm guessing that the pmap convention is still required for TPUs with multiple slices? Or is there a way to do it using sharding? A note in the docs would hopefully save newbs like me some time. I'd be happy to write it up if someone could describe to me the current state of things 🙏

2 replies

cgarciae Jan 4, 2024
Maintainer

Its not clear to me why it wouldn't work. Can you post the error in the JAX repo and reference this discussion?

mikesol Jan 4, 2024
Author

It's not an error per se, rather it gives the message TPU backend initialization is taking more than 60.0 seconds. Did you run your code on all TPU hosts? See https://jax.readthedocs.io/en/latest/multi_process.html for more information. When I tried ensembling via jax.distributed.initialize and pmap, I had to run the same code on all four slices for it to break out of this state. But if I understand correctly, sharding does not coordinate between multiple processes like TPU slices, but rather orchestrates everything from a single process? Or does sharding also coordinate between multiple processes? And if so, does jax.distributed.initialize need to be called and do all of the replication functions need to be called as well from jax utils?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data parallelism best practices #3598

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Data parallelism best practices #3598

mikesol Jan 3, 2024

Replies: 2 comments · 2 replies

cgarciae Jan 4, 2024 Maintainer

mikesol Jan 4, 2024 Author

cgarciae Jan 4, 2024 Maintainer

mikesol Jan 4, 2024 Author

mikesol
Jan 3, 2024

Replies: 2 comments 2 replies

cgarciae
Jan 4, 2024
Maintainer

mikesol
Jan 4, 2024
Author

cgarciae Jan 4, 2024
Maintainer

mikesol Jan 4, 2024
Author