Scope of experiments and computational complexity #115

WaStCo started this conversation in General

WaStCo
Feb 15, 2023
Maintainer

This is not a problem with the current experimental design that we have been discussing, nonetheless I'd like to raise the follwoing question:

We currently rely on Pandas for holding data - this is not efficient for extremely large datasets (i.e. larger than allocated memory). Is our scope to additionally support other methods that would allow for an easier distibution / parallelization?

I know we can scale on disk as long as the partitions fit in memory, but this is not computationally efficient. Thus opening the discussion.

Alternatives would be, for example, Dask or Ray, eventually behind Modin (but that increases our dependency tree).

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment