-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525
Comments
Thanks for opening up this issue. I'll follow up with a longer-form comment describing more of what we were facing and ultimately why we switched off. But want to leave a quick comment on how this feedback was delivered. I think the talk is very easy to be seen as a bit toxic/inflammatory, and that's something I could've handled better. It should have been less derisive and more constructive. Running a startup and trying to hit tight deadlines is incredibly draining. And I think the frustration/exhaustion just boiled over a bit. I think DataFusion is a great project with the unique goal of building a composable engine. It's pretty rare to see in this space and the approach taken here has been pretty innovative. |
Thank you for saying that. I didn't think the tone of your talk was derisive or toxic or inflammatory. I think you brought up several very valid points, and I very much appreciate the need to innovate quickly in a startup and the near-term slowdown that results in working with a larger existing codebase / community. While of I probably would make different tradeoffs (like spending time on this post trying to make DataFusion better 😉 ) I totally understand the right tradeoffs are different for different teams. I wish you nothing but the best at GlareDB. And of course, should you ever change your mind or perhaps DataFusion has improved in some areas, we'll be waiting to welcome you back to the community |
@scsmithr In your follow up I think it would be useful to know if there is anything the DataFusion project could do better that would have made you more likely to contribute the necessary improvements back as opposed to starting from scratch. |
There is more to it. We touched about this on last community call. |
thank you @alamb for the list, this is great! I know GlareDB's perspective is important and it feels like "lost customer deal" for an OSS project. As with all deals, it's equally important to learn from those lost and also from those retained. For example, my perspective from working on SDF is
having said the above, DataFusion is a GREAT library to build on. |
Choosing DataFusionWe chose to use DataFusion for the base of our product ~June 2022 and have kept up with it until earlier this year. When deciding, we looked at:
The main alternative we looked at was DuckDB. But opted to go for it since it was written in C++ (not a bad thing, just I was much more familiar with Rust and was focused on building a Rust team) and it seemed a bit less extensible than DataFusion for our needs. And as a mentioned in the talk, choosing DataFusion got us to a POC very quickly and let us build out the rest of the features we wanted. Core challengesDependenciesHuge challenge for us. As a startup, we wanted to try to bring in libraries instead of trying to write our own. And since we were billing ourselves as a database to query across data sources, bringing in libraries like delta-rs and lance with already written TableProviders for those formats seemed like a no-brainer. However trying to keep all these in sync with the right version of DataFusion (and Arrow) was incredibly challenging and time consuming. Sometimes there were pretty major blockers for either of those to upgrade. Some folks on our team helped out with a few upgrades (e.g. delta-io/delta-rs#2249), but some upgrades ended up taking a bit a time away from building the features we needed. Did we have to upgrade? No, but we chose DataFusion not for where it is today, but where it'll be in the future. And delaying the upgrades meant we would just be stacking up the incompatibilities and missing out on fixes/features that were going in. This issue only started to crop up when we decided to depend on delta-rs & lance, before then we didn't really have dependency issues. UpgradesThis goes hand-in-hand with our dependency challenges. But I want to be more specific around upgrading DataFusion.
I accept that bugs will be introduced. It's software, it happens. The real challenge was actually upgrading to get the bug fixes because of the deps...
I actually don't really care about API breakages if there's a clear path to getting the equivalent functionality working. If something's bad, change it/break it and make it better. As an end user, this is fine. "Better" is subjective and depends on the use case. I'd sum this up as upgrades were fine, our challenges were with upgrades + upgrading dependencies. Component composabilityWe originally chose DataFusion for the composability, and I went in with the idea that I'd be able to have all components within DataFusion be composable. I tried to shoe horn that ideal with our early versions of GlareDB with mixed success. For example, I can't just have an There were a few other issues where attempting to use the pieces separately ended up biting us because of these implicit dependencies. I really wanted to avoid the Essentially it felt like we had to depend on the This has gotten better over time (e.g. the I'd summarize this as: We wanted to use all of DataFusion's features, but in a way where we could control how those features we're used. I want full control over configuring the optimizer, or the listing table, or ... But in many cases, that configuration can only happen with the use of A suggestion might be to open up explicitly configuring components in DataFusion without the use of a Features (misc)WASM
I didn't mean to imply it, just that it's not something actively tested/developed for in DataFusion. I can't recall the exact issue, but late last year (2023) I was working on getting stuff running in the browser and was facing a panic since something was blocking. There's a lot of considerations needed when bringing in dependencies when trying to compile for WASM, and most of our issues around that were just we didn't have a clean split with what works with WASM vs what doesn't. That's not a DataFusion issue. Column naming/aliasing
This came up quite a bit for us earlier since we would have folks try out joins across tables with very similar schemas (e.g. finding the intersection of users between service "A" and service "B") and sometimes those queries would fail due to conflicting column names, even when the user wasn't explicitly selecting them. This has gotten better in DataFusion, but I don't think DataFusion's handling of column names (and generating column names) is good. There's a lot of code to ensure string representations of expressions are the same across planning steps, and it's very prone to breaking. I did try to add in more correlated subquery support but stopped because frankly I found the logical planning with column names to be very unpleasant. Async planning
Yes, this is what we needed. At the time, we just forked the planner since it was easier to make changes that way (since we were also adding table functions for reading from remote sources). Execution
Yes, we wanted to independently execute parts of the queries across heterogeneous machines. We looked Ballista quite a bit, but ultimately when with a more stream-based approach to try to fit in better with how DataFusion executes queries. There's a high-level comment here about what we were trying to accomplish (split execution across a local and remote machine) but the code itself ended up being quite complex. I don't think it makes sense for DataFusion to try to cater to this use case, but this ultimately ended up being a reason why we moved away and is very central to the design of the new system. Happy to go deeper in any of this. |
This is honestly pretty hard to answer. I think a lot of this is just circumstance and needing to move very quickly. I felt like there were areas in DataFusion that I personally wanted to improve and work on, but I ultimately felt like trying to make those improvements would take way too long relative to the time I have. We chose to build on DataFusion in order to move quickly. But ended up being a bit encumbered with dep upgrades, and the context switching between our code base upstream DataFusion definitely had a hit on productivity. |
I can relate to (compared with GreptimeDB) the situation & challenges from @scsmithr's share: the dependency, a bit of headache upgrade procedure etc. I'd like to share some of my experiences: For the consideration of workload, we choose to upgrade DataFusion periodically instead of continuously (like Ubuntu vs. Arch). Hence (1) related dependencies are locked before the next upgrade and (2) need to handle a bunch of accumulated API changes and do a regression test. Since we rarely long for a new API from Arrow eagerly, the first point is acceptable for us. But 2 in contrast, can be classified as the most painful thing of the entire experience 🤣 I tried to conclude this, but it turns out the breaking change of existing API is not the root cause, as they are always explicit and can be solved easily, especially we (the DataFusion) will I haven't found a good solution or suggestion for this problem. I am not even sure if it's possible to maintain all those complex connections among so many plans and optimizers. Given DataFusion is highly extensible and allows user to define their own plan, type or rule etc, this becomes harder to handle. We can't write tests for things that don't exist.. For execution, I tried an approach (slides here #10341 (comment)) that rewrote the plan to enable execution across multiple nodes, which seems to have a similar interface to the
Willing to draft one. Including some small lessons learned from making a Rust-WASM object and the API consideration. I wrote a few notes but never had the motivation (lazy, in other words 🙈) to organize them. |
I have felt the pain points of the version upgrades. I mentioned this in the discord server, but one thing I think we can do that will lessen this is to push hard to get the |
Agree with what @waynexia mentioned, when I upgraded Finally after spending many time to debug, I found it is caused by the implicit change of I think maybe we should make the similar important changes explicit, otherwise it may be so hard to find what was wrong in the complex system built on |
I also relate with the pain/anxiety of updating DataFusion due to changes in implicit logic. It's not really captured by semver, but it would be nice if there were a distinction between breaking changes that simply require fixing compilation errors in a straightforward way, vs those that are changes in behavior without API changes. |
On Dependency Management I have a suggestion that I think could help but would take actual and ongoing work on the DF communities' part. I would suggest that we file tickets as part of every release to test DF with latest versions of common external dependencies (datafusion-python, iceberg, lance, delta, etc) as part of the gating criteria for a release. Perhaps this would be as simple as changing the DF version in those projects and running their test suite. Any blockers that are found in DF have tickets filed and are fixed, other blockers in the respective projects have tickets filed in those projects |
Is your feature request related to a problem or challenge?
I recently watched the Biting the Bullet: Rebuilding GlareDB from the Ground Up (Sean Smith) video from the CMU database series by @scsmithr. I found it very informative (thanks for the talk)
My high level summary of the talk is "we are replacing DataFusion with our own engine, rewritten from the the ground up". It sounds like GlareDB they are still using DataFusion as the new engine is not yet at feature parity
From my perspective, it seems that GlareDB's decision is a result of their judgement on the relative benefits vs estimated costs between
Of course, not every organization / project will have the same tradeoffs, but several of the challenges that @scsmithr described in the talk I think are worth discussing to make them better.
Here is my summary of those challenges:
Effort required / Regressions during upgrades
The general sentiment was that it took a long time to upgrade to DataFusion versions. This is something I have heard from others (@paveltiunov for example from Cube). We also experience this at InfluxData
Also, he mentioned they had hit issues where queries that used to work did not after upgrade.
Possible Improvements
Dependency management
Glare DB uses Lance and delta-rs, which both use DataFusion, and currently this requires the version of DataFusion used by GlareDB to match the(he specifically cited delta-io/delta-rs#2886 which just finally merged)
For what it is worth, @matthewmturner and I hit the same thing in dft (see datafusion-contrib/datafusion-dft#150)
Possible Improvements
WASM support
I don't think he said explicitly that DataFusion couldn't do WASM, but it seemed to be implied
Thanks to the work from @jonmmease @waynexia and others, DataFusion absolutely be compiled to WASM (check out this cool example from @XiangpengHao: https://parquet-viewer.haoxp.xyz/) but maybe it needs to be better documented / explained in a blog
Implicit assumptions in LogicalPlans
Another thing that was mentioned was the challenge of writing custom optimizer rules was challenging because there were implicit assumptions (e.g. that column names were unique)
Possible Improvements
SQL Features
Repeated column names came up as an example feature that was missing in DataFusion.
Possible Improvements
Distributed Planning and Execution
Planning:
The talk mentioned that the DataFusion planner was linear in the way it resolved references and the GlareDB system needed to resolve references from a remote catalog. Their solution seems to have been to fork the DataFusion SQL planner and make it
async
Another approach that is taken by the
SessionContext::sql
Is:async
)I don't think this is particularly well documented
Possible Improvements
Execution:
The talk mentioned several times that GlareDB runs distributed queries and found it challenging to use a different number of threads on different executors (it sounds like maybe they split the
ExecutionPlan
, which already has a target number of partitions baked in). It sounds like their solution was to write a new scheduler / execution engine that didn't have the parallelism baked inAnother potential way to achieve a different threads per execution node is to do the distribution at the
LogicalPlan
level and then run the physical planner on each sub part of the LogicalPlan.Possible Improvements
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: