-
Going briefly over the Python code, I noticed that column filtering optimizations appears to be done by the external PyArrow's Dataset class. Does this mean that the column statistics embedded in the delta log are ignored, since pyarrow has no awareness of the delta log? If so, wouldn't it be an improvement if some column filtering could be handled by delta-rs to avoid unnecessary parquet file scans? I imagine this can have some performance improvements in cloud environments if many files exist. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Nevermind, I now notice that the partition expression for each file is enhanced with statistics that are known by delta-rs in here: delta-rs/python/deltalake/table.py Lines 493 to 502 in e5dd8e2 I have a simple test table which I partitioned on column A and Z-ordered over column B and C. The results from
So I conclude that delta log statistics were embedded in the pyarrow dataset, and pyarrow can optimize file skipping using predicate pushdown on these column statistics. |
Beta Was this translation helpful? Give feedback.
-
Hi - in fact on the python side the actual skipping is done by pyarrow, but the file fragments are generated from the entries in the delta log. there are some improvements we are looking into to harmonize that between python and rust side as well. |
Beta Was this translation helpful? Give feedback.
Nevermind, I now notice that the partition expression for each file is enhanced with statistics that are known by delta-rs in here:
delta-rs/python/deltalake/table.py
Lines 493 to 502 in e5dd8e2
I have a simple test table which I partitioned on column A and Z-ordered over column B and C. The results from
self._table.dataset_partitions
I get: