Can we leverage apache arrow as a language-independent internal memory representation? #269

npuichigo · 2024-11-08T03:36:00Z

No description provided.

mthrok · 2024-11-08T13:00:49Z

Hi

I'm not sure what "language-independent" means, but if you are using Arrow to represent a dataset, then you can pass an iterator to source and it just works.

I'm looking into a way to leverage Arrow or Pandas as a representation of dataset for specific use case but SPDL itself needs no change for that.

npuichigo · 2024-11-09T07:14:24Z

Thanks for your reply. I'm not sure the central representation of using SPDL, maybe the utilities provided by SPDL. I found the C++ code uses something like Buffer, CudaStorage and Tensor, so I asked the question if we can leverage zero-copy Arrow as an universal in-memory representation.

mthrok · 2024-11-09T22:10:48Z

I see. The data structure in C++ code serves different purpose than Dataframe and alike.

The custom structure is just a wrapper around data structure of underlying IO processing library (FFmpeg). It's introduced to minimize the data copy while making the composition flexible. (And also to be independent of particular deep learning framework like PyTorch or JAX, yet compatible with them.) they are only used in spdl.io module, and this entire module is completely independent from spdl.dataloader module.

Data structure like Arrow is used to deal with multiple data points as a set (aka dataset), and so far, SPDL's Pipeline abstraction does not have particular relation ship with such format, because as I mentioned previously, SPDL's Pipeline cares if the input is [async] iterator or not, and the rest of the stages are just callable. If you convert Arrow, Pandas Dataframe, SQLite Database or anything to iterable, that's how SPDL understands the assignment.

Now, we are discussing adding high-level API that looks more like DataLoader class from PyTorch, built on top of the existing APIs. (Imagine like ImageNet dataset from torchvision but dataloader not dataset) That's where the choice of dataset representation, and there will be pros and cons for choosing one. So far I don't have much opinion (though I have slight preference SQLite3 because it won't require any additional dependency). If you have particular technical reasons that Arrow is preferred, I'm interested to listen to them.

npuichigo · 2024-11-11T08:57:42Z

As a high-level API, arrow works well for IterableDataset since it make shuffling in distributed training work well. Just as the iterator used in SPDL, shuffling can only be done with a buffer that big enough, but that's still not enough. It's only needed to shuffle the data source to dispatch to different workers/GPUs. Arrow provides a good abstractionon data source to support zero-copy combination and slice, along with memory mapping. Users can shard the data source in any number logically and make the shuffle works well for streaming or iterable input.

Here's the example of using shared arrow in huggingface datasets:
https://github.com/huggingface/datasets/blob/01f91bae037c98f2e05456287bab21470adb8f07/src/datasets/arrow_dataset.py#L5202-L5213

Since SPDK works on the abstraction of iterable, I think there're some similarities here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we leverage apache arrow as a language-independent internal memory representation? #269

Can we leverage apache arrow as a language-independent internal memory representation? #269

npuichigo commented Nov 8, 2024

mthrok commented Nov 8, 2024

npuichigo commented Nov 9, 2024

mthrok commented Nov 9, 2024

npuichigo commented Nov 11, 2024

Can we leverage apache arrow as a language-independent internal memory representation? #269

Can we leverage apache arrow as a language-independent internal memory representation? #269

Comments

npuichigo commented Nov 8, 2024

mthrok commented Nov 8, 2024

npuichigo commented Nov 9, 2024

mthrok commented Nov 9, 2024

npuichigo commented Nov 11, 2024