Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

Open
10 tasks
alamb opened this issue Nov 22, 2024 · 12 comments
Open
10 tasks

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

alamb opened this issue Nov 22, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

Is your feature request related to a problem or challenge?

I recently watched the Biting the Bullet: Rebuilding GlareDB from the Ground Up (Sean Smith) video from the CMU database series by @scsmithr. I found it very informative (thanks for the talk)

My high level summary of the talk is "we are replacing DataFusion with our own engine, rewritten from the the ground up". It sounds like GlareDB they are still using DataFusion as the new engine is not yet at feature parity

From my perspective, it seems that GlareDB's decision is a result of their judgement on the relative benefits vs estimated costs between

  1. Developing a new engine by themselves (and the associated testing, maintenance, etc)
  2. Working to integrate DataFusion / work with the community to help make it better

Of course, not every organization / project will have the same tradeoffs, but several of the challenges that @scsmithr described in the talk I think are worth discussing to make them better.

Here is my summary of those challenges:

Effort required / Regressions during upgrades

The general sentiment was that it took a long time to upgrade to DataFusion versions. This is something I have heard from others (@paveltiunov for example from Cube). We also experience this at InfluxData

Also, he mentioned they had hit issues where queries that used to work did not after upgrade.

Possible Improvements

Dependency management

Glare DB uses Lance and delta-rs, which both use DataFusion, and currently this requires the version of DataFusion used by GlareDB to match the(he specifically cited delta-io/delta-rs#2886 which just finally merged)

For what it is worth, @matthewmturner and I hit the same thing in dft (see datafusion-contrib/datafusion-dft#150)

Possible Improvements

WASM support

I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied

Thanks to the work from @jonmmease @waynexia and others, DataFusion absolutely be compiled to WASM (check out this cool example from @XiangpengHao: https://parquet-viewer.haoxp.xyz/) but maybe it needs to be better documented / explained in a blog

  • Blog about how to compile DataFusion for WASM??

Implicit assumptions in LogicalPlans

Another thing that was mentioned was the challenge of writing custom optimizer rules was challenging because there were implicit assumptions (e.g. that column names were unique)

Possible Improvements

SQL Features

Repeated column names came up as an example feature that was missing in DataFusion.

select * from (values (1), (2)) v(a, a)

Possible Improvements

Distributed Planning and Execution

Planning:

The talk mentioned that the DataFusion planner was linear in the way it resolved references and the GlareDB system needed to resolve references from a remote catalog. Their solution seems to have been to fork the DataFusion SQL planner and make it async

Another approach that is taken by the SessionContext::sql Is:

  1. Does an initial pass through the parse tree to find all references (non async)
  2. Then fetch all references (can be async)
  3. Then does the planning (non async) with all the relevant references

I don't think this is particularly well documented

Possible Improvements

  • Add an example of implementing a remote / async catalog

Execution:

The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the ExecutionPlan, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler / execution engine that didn't have the parallelism baked in

Another potential way to achieve a different threads per execution node is to do the distribution at the LogicalPlan level and then run the physical planner on each sub part of the LogicalPlan.

Possible Improvements

  • Maybe we could document that better / show an example of how to split / run a distributed plan

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 22, 2024
@alamb alamb changed the title [DISCUSSION] Making it easier (lessons from GlareDB) [DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) Nov 22, 2024
@scsmithr
Copy link
Contributor

Thanks for opening up this issue. I'll follow up with a longer-form comment describing more of what we were facing and ultimately why we switched off.

But want to leave a quick comment on how this feedback was delivered. I think the talk is very easy to be seen as a bit toxic/inflammatory, and that's something I could've handled better. It should have been less derisive and more constructive. Running a startup and trying to hit tight deadlines is incredibly draining. And I think the frustration/exhaustion just boiled over a bit.

I think DataFusion is a great project with the unique goal of building a composable engine. It's pretty rare to see in this space and the approach taken here has been pretty innovative.

@alamb
Copy link
Contributor Author

alamb commented Nov 22, 2024

But want to leave a quick comment on how this feedback was delivered. I think the talk is very easy to be seen as a bit toxic/inflammatory, and that's something I could've handled better. It should have been less derisive and more constructive. Running a startup and trying to hit tight deadlines is incredibly draining. And I think the frustration/exhaustion just boiled over a bit.

Thank you for saying that.

I didn't think the tone of your talk was derisive or toxic or inflammatory. I think you brought up several very valid points, and I very much appreciate the need to innovate quickly in a startup and the near-term slowdown that results in working with a larger existing codebase / community.

While of I probably would make different tradeoffs (like spending time on this post trying to make DataFusion better 😉 ) I totally understand the right tradeoffs are different for different teams.

I wish you nothing but the best at GlareDB. And of course, should you ever change your mind or perhaps DataFusion has improved in some areas, we'll be waiting to welcome you back to the community

@matthewmturner
Copy link
Contributor

@scsmithr In your follow up I think it would be useful to know if there is anything the DataFusion project could do better that would have made you more likely to contribute the necessary improvements back as opposed to starting from scratch.

@findepi
Copy link
Member

findepi commented Nov 22, 2024

Implicit assumptions in LogicalPlans

Another thing that was mentioned was the challenge of writing custom optimizer rules was challenging because there were implicit assumptions (e.g. that column names were unique)

There is more to it. We touched about this on last community call.
in SQL the output column names don't have to be unique, so we should lift this limitation for the output (#13476, #6543). Proper fixing this requires changes how plans are constructed and represented (#13489 (comment), #12723)

@findepi
Copy link
Member

findepi commented Nov 22, 2024

thank you @alamb for the list, this is great! I know GlareDB's perspective is important and it feels like "lost customer deal" for an OSS project. As with all deals, it's equally important to learn from those lost and also from those retained.

For example, my perspective from working on SDF is

having said the above, DataFusion is a GREAT library to build on.

@scsmithr
Copy link
Contributor

Choosing DataFusion

We chose to use DataFusion for the base of our product ~June 2022 and have kept up with it until earlier this year.

When deciding, we looked at:

  • Community: DataFusion seemed to have a quite a few contributors and that didn't look like it was slowing down. And seeing the investment from Influx was a confidence builder.
  • Performance: At the time, DataFusion was already pretty fast, and again, it just felt like it'd be getting better over time.
  • Extensibility: The big thing for us, it was really cool to see that almost every part could be extended and customized.

The main alternative we looked at was DuckDB. But opted to go for it since it was written in C++ (not a bad thing, just I was much more familiar with Rust and was focused on building a Rust team) and it seemed a bit less extensible than DataFusion for our needs.

And as a mentioned in the talk, choosing DataFusion got us to a POC very quickly and let us build out the rest of the features we wanted.

Core challenges

Dependencies

Huge challenge for us. As a startup, we wanted to try to bring in libraries instead of trying to write our own. And since we were billing ourselves as a database to query across data sources, bringing in libraries like delta-rs and lance with already written TableProviders for those formats seemed like a no-brainer.

However trying to keep all these in sync with the right version of DataFusion (and Arrow) was incredibly challenging and time consuming. Sometimes there were pretty major blockers for either of those to upgrade. Some folks on our team helped out with a few upgrades (e.g. delta-io/delta-rs#2249), but some upgrades ended up taking a bit a time away from building the features we needed.

Did we have to upgrade? No, but we chose DataFusion not for where it is today, but where it'll be in the future. And delaying the upgrades meant we would just be stacking up the incompatibilities and missing out on fixes/features that were going in.

This issue only started to crop up when we decided to depend on delta-rs & lance, before then we didn't really have dependency issues.

Upgrades

This goes hand-in-hand with our dependency challenges. But I want to be more specific around upgrading DataFusion.

More testing via additional test coverage (it would be great to see some explicit investment in this area)

I accept that bugs will be introduced. It's software, it happens. The real challenge was actually upgrading to get the bug fixes because of the deps...

Discussion on being more deliberate with API changes / slowing down breaking

I actually don't really care about API breakages if there's a clear path to getting the equivalent functionality working.

If something's bad, change it/break it and make it better. As an end user, this is fine. "Better" is subjective and depends on the use case.

I'd sum this up as upgrades were fine, our challenges were with upgrades + upgrading dependencies.

Component composability

We originally chose DataFusion for the composability, and I went in with the idea that I'd be able to have all components within DataFusion be composable. I tried to shoe horn that ideal with our early versions of GlareDB with mixed success.

For example, I can't just have an Optimizer struct configured with some parameters and rules, I need to go through a SessionContext to have the optimizer work right. For example, we had a bug where select now() would return the wrong time because the optimizer was was using a field provided by the context for replacing the now() call with constant. See: GlareDB/glaredb#1427

There were a few other issues where attempting to use the pieces separately ended up biting us because of these implicit dependencies. I really wanted to avoid the SessionContext so that we could have fine-grained control over planning and execution, but we were forced into using the SessionContext because the optimizer wouldn't work right for certain queries, or and ExecutionPlan would read a configuration var from the TaskContext that could only be set through the SessionContext and have it be properly threaded down.

Essentially it felt like we had to depend on the SessionContext in order to use anything with confidence of it working as expected. Either because things were designed that way, or because we had limited visibility (pub (crate)) to the actual fields/types.

This has gotten better over time (e.g. the OptimizerConfig would have solved our above select now() problem), but we've also had changes go the other way the increased dependence on the SessionContext. One of our earliest issues was the change to FileScanConfig removing the ability to just provide an ObjectStore (#2668). Since we're building a multi-tenant database, we want very complete control over how ObjectStores are constructed and how they're used since they'll typically be unique to the session. The above change forced how we managed object stores if we wanted to use the file scan features from DataFusion. This has also introduced some weirdness when more "dynamic" object stores are needed, e.g. delta-rs needs to generate unique urls for object stores they create just to store them on the context: https://github.com/delta-io/delta-rs/blob/main/crates/core/src/logstore/mod.rs#L257-L265.

I'd summarize this as: We wanted to use all of DataFusion's features, but in a way where we could control how those features we're used. I want full control over configuring the optimizer, or the listing table, or ... But in many cases, that configuration can only happen with the use of SessionContext.

A suggestion might be to open up explicitly configuring components in DataFusion without the use of a SessionContext, and just have the SessionContext provide the easy to use interface for everything.

Features (misc)

WASM

I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied

I didn't mean to imply it, just that it's not something actively tested/developed for in DataFusion. I can't recall the exact issue, but late last year (2023) I was working on getting stuff running in the browser and was facing a panic since something was blocking.

There's a lot of considerations needed when bringing in dependencies when trying to compile for WASM, and most of our issues around that were just we didn't have a clean split with what works with WASM vs what doesn't. That's not a DataFusion issue.

Column naming/aliasing

Repeated column names came up as an example feature that was missing in DataFusion.

select * from (values (1), (2)) v(a, a)

This came up quite a bit for us earlier since we would have folks try out joins across tables with very similar schemas (e.g. finding the intersection of users between service "A" and service "B") and sometimes those queries would fail due to conflicting column names, even when the user wasn't explicitly selecting them.

This has gotten better in DataFusion, but I don't think DataFusion's handling of column names (and generating column names) is good. There's a lot of code to ensure string representations of expressions are the same across planning steps, and it's very prone to breaking. I did try to add in more correlated subquery support but stopped because frankly I found the logical planning with column names to be very unpleasant.

Async planning

Another approach that is taken by the SessionContext::sql Is:

  1. Does an initial pass through the parse tree to find all references (non async)
  2. Then fetch all references (can be async)
  3. Then does the planning (non async) with all the relevant references

I don't think this is particularly well documented

Yes, this is what we needed. At the time, we just forked the planner since it was easier to make changes that way (since we were also adding table functions for reading from remote sources).

Execution

The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the ExecutionPlan, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler / execution engine that didn't have the parallelism baked in

Yes, we wanted to independently execute parts of the queries across heterogeneous machines.

We looked Ballista quite a bit, but ultimately when with a more stream-based approach to try to fit in better with how DataFusion executes queries. There's a high-level comment here about what we were trying to accomplish (split execution across a local and remote machine) but the code itself ended up being quite complex.

I don't think it makes sense for DataFusion to try to cater to this use case, but this ultimately ended up being a reason why we moved away and is very central to the design of the new system.


Happy to go deeper in any of this.

@scsmithr
Copy link
Contributor

@scsmithr In your follow up I think it would be useful to know if there is anything the DataFusion project could do better that would have made you more likely to contribute the necessary improvements back as opposed to starting from scratch.

This is honestly pretty hard to answer.

I think a lot of this is just circumstance and needing to move very quickly. I felt like there were areas in DataFusion that I personally wanted to improve and work on, but I ultimately felt like trying to make those improvements would take way too long relative to the time I have.

We chose to build on DataFusion in order to move quickly. But ended up being a bit encumbered with dep upgrades, and the context switching between our code base upstream DataFusion definitely had a hit on productivity.

@waynexia
Copy link
Member

I can relate to (compared with GreptimeDB) the situation & challenges from @scsmithr's share: the dependency, a bit of headache upgrade procedure etc. I'd like to share some of my experiences:

For the consideration of workload, we choose to upgrade DataFusion periodically instead of continuously (like Ubuntu vs. Arch). Hence (1) related dependencies are locked before the next upgrade and (2) need to handle a bunch of accumulated API changes and do a regression test. Since we rarely long for a new API from Arrow eagerly, the first point is acceptable for us. But 2 in contrast, can be classified as the most painful thing of the entire experience 🤣 I tried to conclude this, but it turns out the breaking change of existing API is not the root cause, as they are always explicit and can be solved easily, especially we (the DataFusion) will #[deprecated] an API for a while before removing it. However, those non-breaking changes at the API level are painful: they are implicit so you won't notice them until something went wrong. E.g.: a new property interface on the plan and some optimizer rules using this property to rewrite the plan.

I haven't found a good solution or suggestion for this problem. I am not even sure if it's possible to maintain all those complex connections among so many plans and optimizers. Given DataFusion is highly extensible and allows user to define their own plan, type or rule etc, this becomes harder to handle. We can't write tests for things that don't exist..

For execution, I tried an approach (slides here #10341 (comment)) that rewrote the plan to enable execution across multiple nodes, which seems to have a similar interface to the RemotePhysicalPlanner from your link. Though this is not yet completed and is still under (inactive 😢) development, it looks viable to me.

Blog about how to compile DataFusion for WASM??

Willing to draft one. Including some small lessons learned from making a Rust-WASM object and the API consideration. I wrote a few notes but never had the motivation (lazy, in other words 🙈) to organize them.

@timsaucer
Copy link
Contributor

I have felt the pain points of the version upgrades. I mentioned this in the discord server, but one thing I think we can do that will lessen this is to push hard to get the datafusion-python upgrade PR through as fast as we can once a release candidate is ready for datafusion. Then we can document all the changes we had to make for it and add them to the release notes. It would be good to have a page similar to what pyo3 does with their migration guide. I know this might slow down the release process but I think it would be a good thing to do since part of our goal is to have datafusion-python expose most of datafusion. It won't catch everything, so we might also want to ask any PRs with breaking API changes to document what users need to do for migration.

@Rachelint
Copy link
Contributor

Rachelint commented Nov 24, 2024

However, those non-breaking changes at the API level are painful: they are implicit so you won't notice them until something went wrong.

Agree with what @waynexia mentioned, when I upgraded datafusion in HoraeDB a few months ago, it is really painful due to the implicit changes.

Finally after spending many time to debug, I found it is caused by the implicit change of TableProvider::scan, columns in filter not included in the pushing down projection anymore.

I think maybe we should make the similar important changes explicit, otherwise it may be so hard to find what was wrong in the complex system built on datafusion.

@jonmmease
Copy link
Contributor

I also relate with the pain/anxiety of updating DataFusion due to changes in implicit logic.

It's not really captured by semver, but it would be nice if there were a distinction between breaking changes that simply require fixing compilation errors in a straightforward way, vs those that are changes in behavior without API changes.

@Omega359
Copy link
Contributor

On Dependency Management I have a suggestion that I think could help but would take actual and ongoing work on the DF communities' part.

I would suggest that we file tickets as part of every release to test DF with latest versions of common external dependencies (datafusion-python, iceberg, lance, delta, etc) as part of the gating criteria for a release. Perhaps this would be as simple as changing the DF version in those projects and running their test suite. Any blockers that are found in DF have tickets filed and are fixed, other blockers in the respective projects have tickets filed in those projects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants