Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checkpointing #997

Open
feldergast opened this issue Sep 28, 2023 · 5 comments
Open

Add checkpointing #997

feldergast opened this issue Sep 28, 2023 · 5 comments
Assignees
Labels
Enhancement in progress Major Feature A new feature that has broad impact on codebase and requires a minimum two week discussion period

Comments

@feldergast
Copy link
Contributor

Add the ability to checkpoint and restart SST runs.

@bhpayne
Copy link

bhpayne commented Sep 29, 2023

Use cases:

An HPC provider (a National Lab) limits HPC jobs to 24 hours. Therefore checkpointing would be useful to run longer jobs.

The separate motivation is that if a job takes 24.5 hours (longer than expected) the job is killed.
As a result, the researcher loses days of productivity because the job had been waiting in the queue.

@gvoskuilen gvoskuilen added in progress Major Feature A new feature that has broad impact on codebase and requires a minimum two week discussion period labels Apr 4, 2024
@gvoskuilen
Copy link
Contributor

Work is underway to make critical structures in SST-Core serializable in preparation for checkpoint generation. We've identified that many structures can be serialized without visible API changes. Handlers however (e.g., clock/event handlers) will need a visible change in definition as follows:

    handler_old = new Handler<Class>(this, &callback_function); // Current way
    handler_new = new Handler2<Class, &callback_function>(this); // New way

For backwards compatibility, the old definition would still work but components using them would not be checkpoint-able.

Other implementation notes:

  • First pass will not support re-partitioning at restart, but this is a longer-term goal. Additional information needs to be kept or reconstructed during a checkpoint to enable it.
  • BaseComponent needs an API to allow elements to checkpoint themselves. This API will evolve as we extend checkpoint support to element libraries. Some libraries may not be checkpoint-able without significant changes.
  • Statistics infrastructure has not yet been evaluated
  • Profile points are not easily checkpointed given their shared state, and cannot be checkpointed once we support re-partitioning. Instead of checkpointing, they would be regenerated on restart.
  • Checkpoint would be supported in SST's run loop. It would not be supported during construction/init/setup.

@feldergast
Copy link
Contributor Author

A few more details on serialization changes needed for checkpointing:

Checkpointing was implemented for "base" objects by implementing a template for the given type. The template was called "serialize". As a convenience function, operator& was also overloaded as a template to allow for simpler syntax in the serialize_order functions. In order to support pointer tracking this structure has been changed somewhat. operator& still calls the serialize template, which does all the pointer tracking, and the original serialize templates have been renamed to serialize_impl. There is a new function call on the serializer to turn pointer tracking on, and once on, it will keep track of all pointers. The data is serialized with the first instance of the pointer and all subsequent instances just put in a tag to the first instance. On deserialization, the object is recreated at the first instance, and all other instances will just be given the pointer to the new object.

Added a new template operator| (operator or). This is only used for the very specific instance of treating a non-pointer as a pointer in the case where the data is stored directly in the object (for example in a map or set), but other objects have pointers to the data. This is needed for the ComponentInfo objects of SubComponents, where the ComponentInfo object is stored in the parent in a std::map<ComponentId_t, ComponentInfo) and the SubComponent has a pointer to the data in its parent. A limitation of this function is that the non-pointer data must be serialized first.

Made a serialize_impl template instantiation that will handle non-polymorphic classes. This allows a non-polymorphic class to serialize with only a serialize_order function and no need to inherit from serializable.

We are considering added an implementation of serialize_impl that will handle classes that return true for std::is_trivially_copyable. In this case, there would be no need for a serialize_order function and it would just use memcpy to serialize the event. We still need to evaluate if is_trivially_copyable will return true for a class with a pointer as one of its data members. If it does, we won't be able to make this work as the pointer would not be pointing to the correct data when deserialized.

@feldergast
Copy link
Contributor Author

Update on TimeVortex checkpointing:

Ultimately, we plan to have events serialized with the Components they are targeting. This is to enable different pre- and post- checkpoint/restart partitioning. For the initial implementation, the TimeVortex will be serialized in-place and the Event::Handlers will be "fixed up" after restarting to point to the correct post-restart handler. This is done by having the Links report their handlers (tag and new pointer) to the Simulation_impl object so that it can exchange the old pointer (used as the tag in the checkpoint) with the new pointer.

@gvoskuilen
Copy link
Contributor

Update on statistics checkpoint: Support for checkpointing statistics was merged in PR #1098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement in progress Major Feature A new feature that has broad impact on codebase and requires a minimum two week discussion period
Projects
None yet
Development

No branches or pull requests

3 participants