merge does not use column stats to skip files? #2701

Zan-L · 2024-07-24T03:09:16Z

Zan-L
Jul 24, 2024

I have a huge table that does not fit into memory and need to upsert into it. Only the data for the last 10 days need to be revised. But delta-rs seems to always attempt to load the entire table into memory first when merge() function is called (#2573). Is that intended? Is merge ever a streamable action (I'm asking this because I understand that you cannot modify a parquet file since it's compressed so if we want to change a record we may need to write an entirely new parquet file. But even with that in mind I don't feel it a must to load the entire dataset into memory for merge)?

Also, a more general point about rust engine, which I believe merge() runs with - it just does not seem to write anything in streaming mode at all (#1984). Although it can be worked around by using pyarrow engine instead, which does streaming writing correctly, other functionalities such as append with schema evolution or merge are only available in rust engine, which means they don't work with data larger than your memory.

Lastly about file skipping: it just doesn't seem to be supported as claimed in the doc, either internally (as in the case of merge), or externally (as in the case of Polars where pl.scan_delta() simply calls to_pyarrow_dataset(), which at best can only push predicates down to PyArrow Dataset and thus does not use any file-level column stats from Delta like those from get_add_actions() to skip parquet files).

Considering the limitation with the not-really-usable-merge, the under-supported file skipping, and a non-streamable rust engine, I feel that delta-rs is in a weird state where it pragmatically cannot deal with (write/merge) larger-than-memory data, although append() works fine but that's only due to pyarrow. Can anyone please shed some light on it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge does not use column stats to skip files? #2701

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

merge does not use column stats to skip files? #2701

Zan-L Jul 24, 2024

Replies: 0 comments

Zan-L
Jul 24, 2024