Recovering DDP scalability #617

RandomDefaultUser · 2024-11-29T12:29:27Z

I recently realized that using the current develop branch of MALA does not reproduce the DDP scaling results achieved upon implementation (#466). Looking into the code, the issue is with the change in how the validation loss is calculated, which occured in #560. #560 unifies the error calculation throughout MALA, which is in general very helpful, but there is one small caveat that I didn't realize at that time: it uses _forward_entire_snapshot to compute predictions on the validation snapshots. That function is (apparently) not as parallelizable through DDP as the direct LDOS validation loss calculation implemented previously. Of course it has the advantage that it gives predictions per snapshot and one therefore can access band energy, total energy, etc. as metrics during training.

These should not be used (at least for now) during DDP training anyway though. So this PR recovers the original DDP scaling behavior by defaulting back to the old validation loss routine if only the LDOS is tracked and using the new scheme everywhere else. I will further open an issue to eventually look into _forward_entire_snapshot to figure out why it does not work as well in DDP.

I will attach scaling results to confirm this is working now. As a side note, for a single GPU, the new route actually seems to be faster.

… something

RandomDefaultUser added 3 commits November 29, 2024 11:29

Reintroduced old validation loss calculation, let's see if this fixes…

8358e03

… something

Forgot a renaming

d3043e6

Refactored code internally

20a06f2

RandomDefaultUser merged commit 03f6b96 into mala-project:develop Nov 29, 2024
5 checks passed

RandomDefaultUser deleted the fix_ddp_validation branch November 29, 2024 12:50

RandomDefaultUser mentioned this pull request Nov 29, 2024

Look into scalability issue in _forward_entire_snapshot #618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering DDP scalability #617

Recovering DDP scalability #617

RandomDefaultUser commented Nov 29, 2024

Recovering DDP scalability #617

Recovering DDP scalability #617

Conversation

RandomDefaultUser commented Nov 29, 2024