Python models: the pandas problem (and a possible solution) #5738
Replies: 3 comments 5 replies
-
related #6296 |
Beta Was this translation helpful? Give feedback.
-
There is one more package that might prove useful. Written in Rust with DataFrame API and supporting SQL: https://www.pola.rs/ |
Beta Was this translation helpful? Give feedback.
-
I think Ibis is a great candidate for a universal dataframe API and I would be very interested to see first-class support for it in dbt. As the discussion here has not been very active and as it did not make it onto the roadmap for 2023, I've decided to create dbt-ibis to bridge the time until dbt itself might include this functionality or offer a plugin system for other modelling languages. Compared to #6296, I decided on not connecting to the database directly but instead to compile Ibis expressions to SQL files which dbt can then pick up as normal models. This greatly simplified the development and works well for my use cases but has a few limitations which I state in the projects README file. The package is still in early development but the first release is out and I'd be curious to hear if this is useful for other people as well! |
Beta Was this translation helpful? Give feedback.
-
2023 edit: probably a better read than this on the topic from the creator of pandas and Ibis: https://wesmckinney.com/blog/looking-back-15-years/
The purpose of this discussion is to talk about the
elephantpanda in the room at the heart of usability issues with Python as a langua franca, alongside SQL, for data transformation in dbt.Before jumping into this discussion, first read:
The pandas problem
pandas is a great tool. It helps many people every day get their jobs done. But it simply does not scale. While there are options to help,
pandas
itself has fundamental flaws acknowledge by its creators and maintainers that prevent it from truly becoming a langua franca for data transformation at scale, a requirement for most organizations.A handwavy overview of the problem
You can read about what pandas aims to achieve in their documentation: https://pandas.pydata.org/docs/getting_started/overview.html
In short, the scalability issue is a result of (relatively inefficient) in-memory computation on a single thread. Aspects of the API are technically challenging or impossible to replicate at scale. In my personal opinion, tools that try to replicate the API and scale have largely failed or have serious usability issues.
You can take a look at the pandas documentation to see how they recommend you scale with the tool: Scaling to large datasets — pandas 1.4.3 documentation (pydata.org)
Notably, the third paragraph states:
Today's dataframe options that scale
What I don't want to get into is arguing that dataframes are an incorrect model for data -- it's way out of my depth.
I am also not arguing that the options I'm going to list below are bad. There are pros and cons to each. What I am arguing is that the user experience (UX) for Python-first data practitioners in general suffers from the fragmentation of the ecosystem and the nuance between APIs across options specific to certain data platforms.
That is, dealing with
groupby
vsgroup_by
and syntactical differences in therename
API. And then on top of slight syntax differences, dealing with scale and performance considerations.Some common tools recommended instead of pandas for scale include, in order of my personal familiarity:
I'm definitely missing others. Each has their own issues and, more fundamentally, don't provide an expressive API across platforms to solve the pandas problem.
Also note they're typically not mutually exclusive -- several can run on Dask backends, Dask can run on Spark backends, etc. Nearly all can convert to pandas. That's all cool technically, but still result in the same UX issues.
Ibis: a cross-platform, open-standard providing the solution?
Recommended reading:
So what does Ibis have over the previously mentioned packages? They're all well-run frameworks with source code (mostly) available on GitHub. Ibis's doc on Ibis for SQL programmers makes the case for its use in dbt well:
Effectively, it aims to be the Python standard for data transformations that follow some of the same best practices of SQL, benefits from the same performance optimizations (by compiling down to native code in the backend), but pulls in the benefits of Python. These all fit well into the needs of dbt users.
It's also created by the founder of pandas and worked on by others in the pydata ecosystem. If there is a good alternative to pandas that scales, this seems like it could be it.
Parting thoughts
This is a discussion, not a statement of intent or roadmap. I hope this can serve to spark some debate in the community and from people more familiar with some of the packages I've mentioned above, to make their case.
From dbt's perspective, it would be ideal to have a single cross-platform open-standard for data transformations in Python inline with the Zen of Python -- there should be one, and preferably only one, obvious way to do it. I've always found that statement a bit comical with the backdrop of the Python package ecosystem, which typically offers a multitude of ways to solve problems with none being particularly obvious.
Still, it's hard to argue the current Python ecosystem for data transforms offers an ideal UX. I hope we can find something better.
Comments, questions, concerns?
Let us know below!
Beta Was this translation helpful? Give feedback.
All reactions