Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we leverage apache arrow as a language-independent internal memory representation? #269

Open
npuichigo opened this issue Nov 8, 2024 · 4 comments

Comments

@npuichigo
Copy link

No description provided.

@mthrok
Copy link
Collaborator

mthrok commented Nov 8, 2024

Hi

I'm not sure what "language-independent" means, but if you are using Arrow to represent a dataset, then you can pass an iterator to source and it just works.

I'm looking into a way to leverage Arrow or Pandas as a representation of dataset for specific use case but SPDL itself needs no change for that.

@npuichigo
Copy link
Author

Thanks for your reply. I'm not sure the central representation of using SPDL, maybe the utilities provided by SPDL. I found the C++ code uses something like Buffer, CudaStorage and Tensor, so I asked the question if we can leverage zero-copy Arrow as an universal in-memory representation.

@mthrok
Copy link
Collaborator

mthrok commented Nov 9, 2024

I see. The data structure in C++ code serves different purpose than Dataframe and alike.

The custom structure is just a wrapper around data structure of underlying IO processing library (FFmpeg). It's introduced to minimize the data copy while making the composition flexible. (And also to be independent of particular deep learning framework like PyTorch or JAX, yet compatible with them.) they are only used in spdl.io module, and this entire module is completely independent from spdl.dataloader module.

Data structure like Arrow is used to deal with multiple data points as a set (aka dataset), and so far, SPDL's Pipeline abstraction does not have particular relation ship with such format, because as I mentioned previously, SPDL's Pipeline cares if the input is [async] iterator or not, and the rest of the stages are just callable. If you convert Arrow, Pandas Dataframe, SQLite Database or anything to iterable, that's how SPDL understands the assignment.

Now, we are discussing adding high-level API that looks more like DataLoader class from PyTorch, built on top of the existing APIs. (Imagine like ImageNet dataset from torchvision but dataloader not dataset) That's where the choice of dataset representation, and there will be pros and cons for choosing one. So far I don't have much opinion (though I have slight preference SQLite3 because it won't require any additional dependency). If you have particular technical reasons that Arrow is preferred, I'm interested to listen to them.

@npuichigo
Copy link
Author

As a high-level API, arrow works well for IterableDataset since it make shuffling in distributed training work well. Just as the iterator used in SPDL, shuffling can only be done with a buffer that big enough, but that's still not enough. It's only needed to shuffle the data source to dispatch to different workers/GPUs. Arrow provides a good abstractionon data source to support zero-copy combination and slice, along with memory mapping. Users can shard the data source in any number logically and make the shuffle works well for streaming or iterable input.

Here's the example of using shared arrow in huggingface datasets:
https://github.com/huggingface/datasets/blob/01f91bae037c98f2e05456287bab21470adb8f07/src/datasets/arrow_dataset.py#L5202-L5213

Since SPDK works on the abstraction of iterable, I think there're some similarities here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants