You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a huge table that does not fit into memory and need to upsert into it. Only the data for the last 10 days need to be revised. But delta-rs seems to always attempt to load the entire table into memory first when merge() function is called (#2573). Is that intended? Is merge ever a streamable action (I'm asking this because I understand that you cannot modify a parquet file since it's compressed so if we want to change a record we may need to write an entirely new parquet file. But even with that in mind I don't feel it a must to load the entire dataset into memory for merge)?
Also, a more general point about rust engine, which I believe merge() runs with - it just does not seem to write anything in streaming mode at all (#1984). Although it can be worked around by using pyarrow engine instead, which does streaming writing correctly, other functionalities such as append with schema evolution or merge are only available in rust engine, which means they don't work with data larger than your memory.
Lastly about file skipping: it just doesn't seem to be supported as claimed in the doc, either internally (as in the case of merge), or externally (as in the case of Polars where pl.scan_delta() simply calls to_pyarrow_dataset(), which at best can only push predicates down to PyArrow Dataset and thus does not use any file-level column stats from Delta like those from get_add_actions() to skip parquet files).
Considering the limitation with the not-really-usable-merge, the under-supported file skipping, and a non-streamable rust engine, I feel that delta-rs is in a weird state where it pragmatically cannot deal with (write/merge) larger-than-memory data, although append() works fine but that's only due to pyarrow. Can anyone please shed some light on it?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a huge table that does not fit into memory and need to upsert into it. Only the data for the last 10 days need to be revised. But delta-rs seems to always attempt to load the entire table into memory first when merge() function is called (#2573). Is that intended? Is merge ever a streamable action (I'm asking this because I understand that you cannot modify a parquet file since it's compressed so if we want to change a record we may need to write an entirely new parquet file. But even with that in mind I don't feel it a must to load the entire dataset into memory for merge)?
Also, a more general point about rust engine, which I believe merge() runs with - it just does not seem to write anything in streaming mode at all (#1984). Although it can be worked around by using pyarrow engine instead, which does streaming writing correctly, other functionalities such as append with schema evolution or merge are only available in rust engine, which means they don't work with data larger than your memory.
Lastly about file skipping: it just doesn't seem to be supported as claimed in the doc, either internally (as in the case of merge), or externally (as in the case of Polars where pl.scan_delta() simply calls to_pyarrow_dataset(), which at best can only push predicates down to PyArrow Dataset and thus does not use any file-level column stats from Delta like those from get_add_actions() to skip parquet files).
Considering the limitation with the not-really-usable-merge, the under-supported file skipping, and a non-streamable rust engine, I feel that delta-rs is in a weird state where it pragmatically cannot deal with (write/merge) larger-than-memory data, although append() works fine but that's only due to pyarrow. Can anyone please shed some light on it?
Beta Was this translation helpful? Give feedback.
All reactions