Python models: the pandas problem (and a possible solution) #5738

lostmygithubaccount · 2022-08-31T02:56:29Z

lostmygithubaccount
Aug 31, 2022

2023 edit: probably a better read than this on the topic from the creator of pandas and Ibis: https://wesmckinney.com/blog/looking-back-15-years/

The purpose of this discussion is to talk about the ~~elephant~~ panda in the room at the heart of usability issues with Python as a langua franca, alongside SQL, for data transformation in dbt.

Before jumping into this discussion, first read:

The pandas problem

pandas is a great tool. It helps many people every day get their jobs done. But it simply does not scale. While there are options to help, pandas itself has fundamental flaws acknowledge by its creators and maintainers that prevent it from truly becoming a langua franca for data transformation at scale, a requirement for most organizations.

A handwavy overview of the problem

You can read about what pandas aims to achieve in their documentation: https://pandas.pydata.org/docs/getting_started/overview.html

In short, the scalability issue is a result of (relatively inefficient) in-memory computation on a single thread. Aspects of the API are technically challenging or impossible to replicate at scale. In my personal opinion, tools that try to replicate the API and scale have largely failed or have serious usability issues.

You can take a look at the pandas documentation to see how they recommend you scale with the tool: Scaling to large datasets — pandas 1.4.3 documentation (pydata.org)

Notably, the third paragraph states:

But first, it’s worth considering not using pandas. pandas isn’t the right tool for all situations. If you’re working with very large datasets and a tool like PostgreSQL fits your needs, then you should probably be using that. Assuming you want or need the expressiveness and power of pandas, let’s carry on.

Today's dataframe options that scale

What I don't want to get into is arguing that dataframes are an incorrect model for data -- it's way out of my depth.

I am also not arguing that the options I'm going to list below are bad. There are pros and cons to each. What I am arguing is that the user experience (UX) for Python-first data practitioners in general suffers from the fragmentation of the ecosystem and the nuance between APIs across options specific to certain data platforms.

That is, dealing with groupby vs group_by and syntactical differences in the rename API. And then on top of slight syntax differences, dealing with scale and performance considerations.

Some common tools recommended instead of pandas for scale include, in order of my personal familiarity:

Dask, a pure-Python distributed framework similar to Spark with distributable dataframes built on pandas on the rest of the pydata ecosystem
Modin, intended as a drop-in replacement with Dask or Ray or other backends powering
Koalas and PySpark dataframes, built on Spark
Snowpark for Snowflake, which compiles down to SQL
Fugue, probably the most interesting of this group for this discussion

I'm definitely missing others. Each has their own issues and, more fundamentally, don't provide an expressive API across platforms to solve the pandas problem.

Also note they're typically not mutually exclusive -- several can run on Dask backends, Dask can run on Spark backends, etc. Nearly all can convert to pandas. That's all cool technically, but still result in the same UX issues.

Ibis: a cross-platform, open-standard providing the solution?

Parting thoughts

This is a discussion, not a statement of intent or roadmap. I hope this can serve to spark some debate in the community and from people more familiar with some of the packages I've mentioned above, to make their case.

From dbt's perspective, it would be ideal to have a single cross-platform open-standard for data transformations in Python inline with the Zen of Python -- there should be one, and preferably only one, obvious way to do it. I've always found that statement a bit comical with the backdrop of the Python package ecosystem, which typically offers a multitude of ways to solve problems with none being particularly obvious.

Still, it's hard to argue the current Python ecosystem for data transforms offers an ideal UX. I hope we can find something better.

Comments, questions, concerns?

Let us know below!

lostmygithubaccount · 2022-12-06T20:54:05Z

lostmygithubaccount
Dec 6, 2022
Author

related #6296

0 replies

asheone · 2022-12-12T10:48:18Z

asheone
Dec 12, 2022

There is one more package that might prove useful. Written in Rust with DataFrame API and supporting SQL: https://www.pola.rs/

1 reply

lostmygithubaccount Dec 19, 2022
Author

polars is very interesting! I didn't realize it supported SQL -- it looks like only for reading from? I definitely want to spend some time with it in the near future

binste · 2023-09-07T15:12:20Z

binste
Sep 7, 2023

I think Ibis is a great candidate for a universal dataframe API and I would be very interested to see first-class support for it in dbt.

As the discussion here has not been very active and as it did not make it onto the roadmap for 2023, I've decided to create dbt-ibis to bridge the time until dbt itself might include this functionality or offer a plugin system for other modelling languages.

Compared to #6296, I decided on not connecting to the database directly but instead to compile Ibis expressions to SQL files which dbt can then pick up as normal models. This greatly simplified the development and works well for my use cases but has a few limitations which I state in the projects README file.

The package is still in early development but the first release is out and I'd be curious to hear if this is useful for other people as well!

4 replies

max-sixty Sep 7, 2023

Nice work @binste ! I think your approach is good — having a dbt-ibis run CLI — rather than trying to make it work directly with dbt run, which was too ambitious and didn't really work with dbt-prql.

Let's keep in touch — I'm still really keen to have an integration, but haven't worked on this in a while.

Hopefully your effort will help us figure out the best design for these. I do think it would be great if these can eventually be plugins called with dbt run, rather than a different entrypoint which attempts to replicate the standard entrypoint.

lostmygithubaccount Sep 7, 2023
Author

I'd be vaguely interested in helping here (particularly on the Ibis side), though my time for random projects is pretty low right now. regardless, very cool to see and I'll definitely be following along!

binste Sep 10, 2023

Thank you both for the feedback!

@max-sixty Fully agree, it would be great if we can just use the normal dbt CLI. In case this would not be implemented, we could consider having a separate package to provide a framework to implement support for more modelling languages. If you call dbt-ibis run, the following things happen on a very high level:

Run a customized dbt parse command to get the relevant information about the dbt project without having to parse it myself, e.g. project root, names of model and target folders, information on nodes from the yml files, etc. In addition to the manifest, it also returns the runtime config and suppresses some errors and does some other magic to not have side effects (e.g. back up and restore dbt artifacts)
Compile Ibis models to SQL
a. Find all model files with the .ibis extension
b. Compile them to .sql files with ref() and source() Jinja macros
Execute the original dbt run command

Apart from the file extension in 2.a, other languages such as prql might only need to implement 2.b. But as you said, let's see if this is the best design and of course, first-class support in dbt would be better :)

@lostmygithubaccount Good to know! I will ping you in case I get stuck somewhere on the Ibis side and would appreciate another set of eyes.

binste Oct 2, 2023

I just released a new version of dbt-ibis with support for the following dbt adapters 🥳 :

DuckDB
Snowflake
Postgres
Redshift
Trino
MySQL
SQLite
Oracle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python models: the pandas problem (and a possible solution) #5738

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Python models: the pandas problem (and a possible solution) #5738

lostmygithubaccount Aug 31, 2022

The pandas problem

A handwavy overview of the problem

Today's dataframe options that scale

Ibis: a cross-platform, open-standard providing the solution?

Parting thoughts

Comments, questions, concerns?

Replies: 3 comments · 5 replies

lostmygithubaccount Dec 6, 2022 Author

asheone Dec 12, 2022

lostmygithubaccount Dec 19, 2022 Author

binste Sep 7, 2023

max-sixty Sep 7, 2023

lostmygithubaccount Sep 7, 2023 Author

binste Sep 10, 2023

binste Oct 2, 2023

lostmygithubaccount
Aug 31, 2022

Replies: 3 comments 5 replies

lostmygithubaccount
Dec 6, 2022
Author

asheone
Dec 12, 2022

lostmygithubaccount Dec 19, 2022
Author

binste
Sep 7, 2023

lostmygithubaccount Sep 7, 2023
Author