File skipping based on log metadata statistics #1486

iiLaurens · 2023-06-22T12:42:37Z

iiLaurens
Jun 22, 2023

Going briefly over the Python code, I noticed that column filtering optimizations appears to be done by the external PyArrow's Dataset class. Does this mean that the column statistics embedded in the delta log are ignored, since pyarrow has no awareness of the delta log?

If so, wouldn't it be an improvement if some column filtering could be handled by delta-rs to avoid unnecessary parquet file scans? I imagine this can have some performance improvements in cloud environments if many files exist.

Answered by iiLaurens

Jun 22, 2023

Nevermind, I now notice that the partition expression for each file is enhanced with statistics that are known by delta-rs in here:

delta-rs/python/deltalake/table.py

Lines 493 to 502 in e5dd8e2

     fragments = [  
   format.make_fragment(  
   file,  
   filesystem=filesystem,  
   partition_expression=part_expression,  
   )  
   for file, part_expression in self._table.dataset_partitions(  
   self.schema().to_pyarrow(), partitions  
   )  
   ]  

 

I have a simple test table which I partitioned on column A and Z-ordered over column B and C. The results from self._table.dataset_partitions I get:

[('A=2/part-00001-1ebb693e-4e30-4840-8896-daed52a3f0ff-c000.zstd.parquet',
  <pya…

View full answer

iiLaurens · 2023-06-22T13:33:25Z

iiLaurens
Jun 22, 2023
Author

Nevermind, I now notice that the partition expression for each file is enhanced with statistics that are known by delta-rs in here:

delta-rs/python/deltalake/table.py

Lines 493 to 502 in e5dd8e2

    
           fragments = [ 
        
               format.make_fragment( 
        
                   file, 
        
                   filesystem=filesystem, 
        
                   partition_expression=part_expression, 
        
               ) 
        
               for file, part_expression in self._table.dataset_partitions( 
        
                   self.schema().to_pyarrow(), partitions 
        
               ) 
        
           ]

I have a simple test table which I partitioned on column A and Z-ordered over column B and C. The results from self._table.dataset_partitions I get:

[('A=2/part-00001-1ebb693e-4e30-4840-8896-daed52a3f0ff-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 2) and (D >= 0)) and (C >= 0)) and (B >= 0)) and (C <= 9999)) and (D <= 9999)) and (B <= 9999)) and is_valid(B)) and is_valid(C)) and is_valid(D))>),
 ('A=2/part-00002-1ebb693e-4e30-4840-8896-daed52a3f0ff-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 2) and (D >= 0)) and (B >= 0)) and (C >= 0)) and (B <= 9999)) and (D <= 9999)) and (C <= 9999)) and is_valid(B)) and is_valid(D)) and is_valid(C))>),
 ('A=2/part-00003-1ebb693e-4e30-4840-8896-daed52a3f0ff-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 2) and (B >= 0)) and (C >= 2048)) and (D >= 0)) and (D <= 9999)) and (B <= 9999)) and (C <= 8191)) and is_valid(C)) and is_valid(B)) and is_valid(D))>),
 ('A=2/part-00004-1ebb693e-4e30-4840-8896-daed52a3f0ff-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 2) and (D >= 0)) and (B >= 2048)) and (C >= 3072)) and (B <= 8191)) and (C <= 8191)) and (D <= 9999)) and is_valid(D)) and is_valid(B)) and is_valid(C))>),
 ('A=0/part-00001-c7095c51-23fb-49d7-b16e-0c5ce1624b94-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 0) and (D >= 0)) and (B >= 0)) and (C >= 0)) and (C <= 9999)) and (D <= 9999)) and (B <= 9999)) and is_valid(D)) and is_valid(C)) and is_valid(B))>),
 ('A=0/part-00002-c7095c51-23fb-49d7-b16e-0c5ce1624b94-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 0) and (B >= 0)) and (C >= 0)) and (D >= 0)) and (B <= 9999)) and (C <= 9999)) and (D <= 9999)) and is_valid(B)) and is_valid(D)) and is_valid(C))>),
 ('A=0/part-00003-c7095c51-23fb-49d7-b16e-0c5ce1624b94-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 0) and (B >= 0)) and (D >= 0)) and (C >= 2048)) and (C <= 8191)) and (D <= 9999)) and (B <= 9999)) and is_valid(B)) and is_valid(D)) and is_valid(C))>),
 ('A=0/part-00004-c7095c51-23fb-49d7-b16e-0c5ce1624b94-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 0) and (B >= 2048)) and (D >= 0)) and (C >= 3072)) and (C <= 8191)) and (D <= 9999)) and (B <= 8191)) and is_valid(D)) and is_valid(C)) and is_valid(B))>),
 ('A=1/part-00001-8dc8f013-78b5-41cb-8c7a-ef6d604d49c2-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 1) and (B >= 0)) and (C >= 0)) and (D >= 0)) and (C <= 9999)) and (D <= 9999)) and (B <= 9999)) and is_valid(D)) and is_valid(B)) and is_valid(C))>),
 ('A=1/part-00002-8dc8f013-78b5-41cb-8c7a-ef6d604d49c2-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 1) and (D >= 0)) and (B >= 0)) and (C >= 0)) and (B <= 9999)) and (D <= 9999)) and (C <= 9999)) and is_valid(C)) and is_valid(D)) and is_valid(B))>),
 ('A=1/part-00003-8dc8f013-78b5-41cb-8c7a-ef6d604d49c2-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 1) and (D >= 0)) and (C >= 2048)) and (B >= 0)) and (C <= 8191)) and (D <= 9999)) and (B <= 9999)) and is_valid(D)) and is_valid(B)) and is_valid(C))>),
 ('A=1/part-00004-8dc8f013-78b5-41cb-8c7a-ef6d604d49c2-c000.zstd.parquet',
  <pyarrow.compute.Expression ((((((((((A == 1) and (D >= 0)) and (C >= 3072)) and (B >= 2048)) and (C <= 8191)) and (D <= 9999)) and (B <= 8191)) and is_valid(B)) and is_valid(C)) and is_valid(D))>)]

So I conclude that delta log statistics were embedded in the pyarrow dataset, and pyarrow can optimize file skipping using predicate pushdown on these column statistics.

0 replies

roeap · 2023-06-22T13:33:43Z

roeap
Jun 22, 2023
Collaborator

Hi - in fact on the python side the actual skipping is done by pyarrow, but the file fragments are generated from the entries in the delta log. there are some improvements we are looking into to harmonize that between python and rust side as well.

1 reply

roeap Jun 22, 2023
Collaborator

You just figured it out :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File skipping based on log metadata statistics #1486

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

	fragments = [
	format.make_fragment(
	file,
	filesystem=filesystem,
	partition_expression=part_expression,
	)
	for file, part_expression in self._table.dataset_partitions(
	self.schema().to_pyarrow(), partitions
	)
	]

File skipping based on log metadata statistics #1486

iiLaurens Jun 22, 2023

Replies: 2 comments · 1 reply

iiLaurens Jun 22, 2023 Author

roeap Jun 22, 2023 Collaborator

roeap Jun 22, 2023 Collaborator

iiLaurens
Jun 22, 2023

Replies: 2 comments 1 reply

iiLaurens
Jun 22, 2023
Author

roeap
Jun 22, 2023
Collaborator

roeap Jun 22, 2023
Collaborator