Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I recently realized that using the current
develop
branch of MALA does not reproduce the DDP scaling results achieved upon implementation (#466). Looking into the code, the issue is with the change in how the validation loss is calculated, which occured in #560. #560 unifies the error calculation throughout MALA, which is in general very helpful, but there is one small caveat that I didn't realize at that time: it uses_forward_entire_snapshot
to compute predictions on the validation snapshots. That function is (apparently) not as parallelizable through DDP as the direct LDOS validation loss calculation implemented previously. Of course it has the advantage that it gives predictions per snapshot and one therefore can access band energy, total energy, etc. as metrics during training.These should not be used (at least for now) during DDP training anyway though. So this PR recovers the original DDP scaling behavior by defaulting back to the old validation loss routine if only the LDOS is tracked and using the new scheme everywhere else. I will further open an issue to eventually look into
_forward_entire_snapshot
to figure out why it does not work as well in DDP.I will attach scaling results to confirm this is working now. As a side note, for a single GPU, the new route actually seems to be faster.