Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: upgrade to datafusion 43 #2886

Merged
merged 14 commits into from
Nov 21, 2024
Merged

Conversation

ion-elgreco
Copy link
Collaborator

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Sep 14, 2024
@rtyler
Copy link
Member

rtyler commented Sep 16, 2024

We need Datafusion to upgrade as well before we can do this. I have some patches to make the latest datafusion work but we're blocked on them to adopt the newest arrow and kernel

@rtyler rtyler self-assigned this Sep 16, 2024
@rtyler rtyler marked this pull request as draft September 16, 2024 16:24
@Xuanwo
Copy link
Contributor

Xuanwo commented Sep 18, 2024

Hi, datafusion has been released. Let's rock!

@alamb
Copy link

alamb commented Sep 20, 2024

Thank you @ion-elgreco -- let me know if there is anything I can to do help with this PR (e.g. I could work on updating the code to use non deprecated APIs in the core for parquet statistics...)

@ion-elgreco
Copy link
Collaborator Author

Hey @alamb, I am not working on this atm, but @rtyler has picked it up from here

@rtyler rtyler changed the title chore: bump kernel chore: upgrade to datafusion 42 Sep 20, 2024
@rtyler rtyler force-pushed the chore/bump_kernel branch 2 times, most recently from 8db59c2 to b325e27 Compare September 20, 2024 12:35
@ion-elgreco
Copy link
Collaborator Author

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

I had a quick glance at the Changelog but can't really pinpoint where this change may have occurred, do you have any ideas?

DFSchema { inner: 
Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
new schema: 
DFSchema { inner: Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },

@alamb
Copy link

alamb commented Sep 20, 2024

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

It sounds somewhat similar to a discussion happening here about nullability (but I think field name is similar in terms of how it is treated): apache/datafusion#11989 (comment)

Update; filed apache/datafusion#12560 to track the regression more

@rtyler
Copy link
Member

rtyler commented Sep 20, 2024

@alamb thanks for following up! I feel a little sheepish 🐑 insofar that my internal CI showed these errors weeks ago in my integration testing between datafusion main with delta-rs main, I just didn't have the time to triage them like @ion-elgreco has 😄

I've subscribed to the issue you created and can help testing if necessary! Depending on my availability this weekend I can try fixing it in datafusion too 🤔

@matthewmturner
Copy link
Contributor

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

@ion-elgreco
Copy link
Collaborator Author

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

It's not yet released though

@rtyler
Copy link
Member

rtyler commented Oct 7, 2024

I can confirm that the latest datafusion main passes tests in CI 🥳

@rtyler rtyler changed the title chore: upgrade to datafusion 42 chore: upgrade to datafusion 43 Oct 8, 2024
@alamb
Copy link

alamb commented Oct 8, 2024

fix

I filed apache/datafusion#12813 to consider releasing a patch version of DataFusion

@rtyler
Copy link
Member

rtyler commented Nov 17, 2024

@timsaucer I believe the error is introduced further up the optimization rules stack FWIW.

Going into optimization

Projection: target.id, target.temp, CASE __delta_rs_update_predicate WHEN Boolean(true) THEN CAST(make_parquet_array(Int64(100)) AS List(Field { name: "element", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, met
adata: {} })) ELSE target.items END AS items, __delta_rs_update_predicate

After invoking simplify_expressions

 DEBUG datafusion_optimizer::utils                                 > simplify_expressions:
Projection: target.id, target.temp, CASE __delta_rs_update_predicate WHEN Boolean(true) THEN List([100]) ELSE target.items END AS items, __delta_rs_update_predicate
  MetricObserver id=update_source_count
    Projection: target.id, target.temp, target.items, target.id = Int32(1) OR target.id != Int32(1) AND Boolean(NULL) AS __delta_rs_update_predicate
      TableScan: target

SimplifyExpressions appears to erase the defined field names which only causes problems later in OptimizeProjections.

I might not be qualified to fix this sticky of a bug 😆

@timsaucer
Copy link
Contributor

FYI this does correct the error in the unit test test_update_with_array. I need to add unit tests in DataFusion still.
apache/datafusion#13468

@alamb
Copy link

alamb commented Nov 20, 2024

We plan to make another DataFusion patch release to unblock this upgrade:

Today the make_array function from Datafusion uses "item" as the list
element's field name. With recent changes in delta-kernel-rs we have
switched to calling it "element" which is more conventional related to
how Apache Parquet handles things

This change introduces a test which helps isolate the behavior seen in
Python tests within the core crate for easier regression testing

Signed-off-by: R. Tyler Croy <[email protected]>
@rtyler rtyler disabled auto-merge November 21, 2024 00:52
@rtyler rtyler merged commit 25c6c6f into delta-io:main Nov 21, 2024
24 checks passed
@alamb
Copy link

alamb commented Nov 21, 2024

its-happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema metadata descriptions cause write failures in deltalake >= 0.19.0
8 participants