Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to store checkpoints in an external storage such as S3? #359

Open
hr0nix opened this issue Jun 18, 2023 · 4 comments
Open

Comments

@hr0nix
Copy link

hr0nix commented Jun 18, 2023

I wasn't able to find the answers to my questions in the docs, so I'll just ask here:

  • What storage types other than local filesystem are supported with orbax? For instance, can I use S3?
  • Is it possible to add my own storage type somehow?

Thanks!

@cpgaffney1
Copy link
Collaborator

We support a Google-internal distributed file system as well as Google Cloud storage. No idea if any issues would be encountered with S3, but you could give it a try.

Depending on what issues you encounter, if any, implementing your own TypeHandlers and AggregateHandler would probably be the best approach to customize serialization / deserialization logic if you need to. See here: https://orbax.readthedocs.io/en/latest/api_reference/checkpoint.html. Once implemented, you just register the handlers to start using them.

@hr0nix
Copy link
Author

hr0nix commented Jun 28, 2023

One way to support a large number of various filesystems would be to use fsspec for reading/writing weight files. Is that something the orbax/jax team might consider?

@andylolu2
Copy link

A temporary workaround is to save to a temp directory and copy the saved content to the remote file system, though this wouldn't work so easily with the checkpoint manager (e.g., only save the last n checkpoints)

@cpgaffney1
Copy link
Collaborator

There's a recent change to offer better support for this problem. Previously S3 would not work correctly because atomic rename was not supported, but alternative atomicity logic can be configured using checkpoint/orbax/checkpoint/path/atomicity.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants