-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we leverage apache arrow as a language-independent internal memory representation? #269
Comments
Hi I'm not sure what "language-independent" means, but if you are using Arrow to represent a dataset, then you can pass an iterator to source and it just works. I'm looking into a way to leverage Arrow or Pandas as a representation of dataset for specific use case but SPDL itself needs no change for that. |
Thanks for your reply. I'm not sure the central representation of using SPDL, maybe the utilities provided by SPDL. I found the C++ code uses something like |
I see. The data structure in C++ code serves different purpose than Dataframe and alike. The custom structure is just a wrapper around data structure of underlying IO processing library (FFmpeg). It's introduced to minimize the data copy while making the composition flexible. (And also to be independent of particular deep learning framework like PyTorch or JAX, yet compatible with them.) they are only used in spdl.io module, and this entire module is completely independent from spdl.dataloader module. Data structure like Arrow is used to deal with multiple data points as a set (aka dataset), and so far, SPDL's Pipeline abstraction does not have particular relation ship with such format, because as I mentioned previously, SPDL's Pipeline cares if the input is [async] iterator or not, and the rest of the stages are just callable. If you convert Arrow, Pandas Dataframe, SQLite Database or anything to iterable, that's how SPDL understands the assignment. Now, we are discussing adding high-level API that looks more like DataLoader class from PyTorch, built on top of the existing APIs. (Imagine like ImageNet dataset from torchvision but dataloader not dataset) That's where the choice of dataset representation, and there will be pros and cons for choosing one. So far I don't have much opinion (though I have slight preference SQLite3 because it won't require any additional dependency). If you have particular technical reasons that Arrow is preferred, I'm interested to listen to them. |
As a high-level API, arrow works well for Here's the example of using shared arrow in huggingface datasets: Since SPDK works on the abstraction of iterable, I think there're some similarities here. |
No description provided.
The text was updated successfully, but these errors were encountered: