Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering DDP scalability #617

Merged

Conversation

RandomDefaultUser
Copy link
Member

I recently realized that using the current develop branch of MALA does not reproduce the DDP scaling results achieved upon implementation (#466). Looking into the code, the issue is with the change in how the validation loss is calculated, which occured in #560. #560 unifies the error calculation throughout MALA, which is in general very helpful, but there is one small caveat that I didn't realize at that time: it uses _forward_entire_snapshot to compute predictions on the validation snapshots. That function is (apparently) not as parallelizable through DDP as the direct LDOS validation loss calculation implemented previously. Of course it has the advantage that it gives predictions per snapshot and one therefore can access band energy, total energy, etc. as metrics during training.

These should not be used (at least for now) during DDP training anyway though. So this PR recovers the original DDP scaling behavior by defaulting back to the old validation loss routine if only the LDOS is tracked and using the new scheme everywhere else. I will further open an issue to eventually look into _forward_entire_snapshot to figure out why it does not work as well in DDP.

I will attach scaling results to confirm this is working now. As a side note, for a single GPU, the new route actually seems to be faster.

problem01
problem02

@RandomDefaultUser RandomDefaultUser merged commit 03f6b96 into mala-project:develop Nov 29, 2024
5 checks passed
@RandomDefaultUser RandomDefaultUser deleted the fix_ddp_validation branch November 29, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant