dbt Python models #5261

jtcohen6 · 2022-05-17T18:12:29Z

jtcohen6
May 17, 2022
Maintainer

Background & context

Since the dawn of time, dbt has been SQL first and SQL only. We believe that SQL is the most effective, accessible way to express critical business logic in modular, tested, well-documented data transformations. We still believe that. We also believe it's time to add Python into the mix.

Who is it for?

We're envisioning two target audiences, and two target use cases:

Primary: Analysts and analytics engineers who need Python to solve use cases they can’t with SQL.
Secondary: Data Scientists, MLOps Engineers, and other colleagues who want to work with the same awesome tooling as their peers in analytics, and build off their compelling work, rather than trying to stitch lineage, coordinate orchestration, and avoid duplicating logic across several tools.

That is, our primary intent with Python models is to extend dbt’s reach into solving new types of problems, by opening up new kinds of modeling to people for whom it was previously inaccessible: people for whom the data warehouse is their principal tool. This is in addition to, not instead of, SQL. I envision the vast majority of dbt models being written in SQL, for the foreseeable future. Maybe not forever, but for some time to come.

By corollary: The ability to define vanilla data transformations via a fluent dataframe syntax, as an alternative to SQL, is a thing this feature will make possible for some, but it isn't our primary reason for building it.

Why build it? Why now?

dbt as open source standard for data transformation

SQL gets us a lot, but it cannot get us everything. There are real limits to what can be expressed in SQL, especially as you wade further into complex statistics. So long as dbt is SQL-only, those limits will be dbt’s limits, too.

There are lots of other languages that could expand our boundaries, many of them already supported by data warehouse vendors. Of all languages we could seek to add support for—R, Scala, JavaScript (!)—Python is the best mix of flexible, accessible, and popular. It’s another lingua franca for data, and the native tongue of many existing practitioners. It will unlock the most doors for the most people.

For teams that are doing analytics work today (all in SQL), and who have aspirations of doing data science / ML work (in Python) tomorrow, dbt can make the shift even easier. No new tools necessary.

dbt as participant in the ever-evolving modern data ecosystem

Databricks has long supported Python and PySpark transformations. Snowflake is adding this support via Snowpark. Google is building better integrations between BigQuery and its ML / Spark-running services.

When data warehouse vendors release exciting new functionality, we as dbt think hard about how best to expose that functionality. Sometimes we simply add support for the new capability as-is, as a new opt-in configuration, for those who know they need it. In other cases, we try to offer useful abstractions over the new capabilities, encoding our opinionated recommendations for how they ought to be used, in the hopes of making them more accessible to more people.

Insofar as cloud warehouses are seeing a need to support data science and ML workflows, dbt should seek to be the unifying interface, just as it has become for SQL-language analytical models.

dbt as a longer-term vision for data teams & organizations

In 2016, when we said that analysts should work more like software engineers, that statement was attempting to bridge a gulf that existed at many many organizations. Analysts sat on different teams, using different tools, even from data engineers. In the 5+ years since, dbt has done a lot to bridge the gap between analysts and data engineers—it’s the tool they can share, a site of collaboration.

Today, there exists a similar gap, at many organizations, between analytics teams and data scientists, ML Ops, etc. Many times these teams are using entirely different tools, and (worse) entirely different assumptions about their core data.

I don’t anticipate dbt Python models revolutionizing the day-to-day work of dedicated ML Ops engineers on the day we release the feature, or even 6 months afterward. But we believe that gap, in tooling and collaboration, should close. Let’s not underestimate the change we can accomplish within a few years.

What is it?

Users write .py model files. dbt parses that model to understand its configuration and DAG lineage (like Jinja, without Jinja). dbt ships that Python code off to the data warehouse, where that Python transformation is executed (just like SQL). The result is a table in the database, with the data resulting from that transformation (just like SQL models).

# models/my_python_model.py

def model(dbt):

    dbt.config(
        materialized='table'
    )

    # these are DAG-aware, and return dataframes
    dim_all_learners = dbt.ref("dim_all_learners")
    source_thinkific_users = dbt.source("thinkific", "users")

    philly_sample = dim_all_learners
                        .filter(col("city"=="Philly"))
                        .limit(1000)

    import numpy as np
    from sklearn.linear_model import LinearRegression
    ...
		
    # your final 'select' statement
    df = philly_sample.select("*")

    return df

-- models/downstream_sql_model.sql

select * from {{ ref('my_python_model') }}

Which adapters?

This capability will be available on data platforms that support running Python-language transformations at scale.

For starters, we're working on:

Snowflake, via Snowpark stored procedures
Databricks, via Workspace API. Our current approach is to create each Python model as a short-lived notebook. We're also investigating the Jobs API. We think it would be possible to support this on Apache Spark as well, e.g. via spark-submit. We'd welcome help and contribution from community members who know the intricacies of Apache Spark better than we do.
GCP (BigQuery + Dataproc). We're still spiking this, and figuring out some of the finer details. So far, it requires the most complex setup of the three, but we're hoping it's one-time setup that can be documented and reused.

As with all dbt Core functionality, we believe the capability can be extended to other databases / engines that offer built-in support for batch Python transformations. We'd be excited to support maintainers of other adapters who want to add similar capabilities on other platforms.

It's likely that this functionality will differ across adapters, for a few reasons:

Different infrastructure running behind the scenes
Different "dataframe" implementations and syntax support
Different mechanisms for installing third-party packages

We'll do our best to offer shared abstractions where they're useful, and call out what needs to be different in each platform.

The first cut

All the things we're working to figure out now:

Creating: Python models are defined as .py files in the models/ directory.
Materializations: Python models can be materialized as table or incremental.
Consistency: Python models can be configured just like SQL models, whether with tags, a custom schema, or otherwise. They can be described just like SQL models, run standard tests (written in SQL), and show up in the auto-generated documentation site with appropriate lineage.
Packages: Python models can import PyPI packages available on the data platform. dbt offers product functionality or clear recommendations on how to configure those packages on the platform.

For later:

Python models can return functions (UDFs), rather than specific tabular datasets. Those functions can then be called in other models. This may or may not be our primary recommendation for code reuse between Python models.
Good guardrails: Static parsing Python models to catch syntax errors before sending to the warehouse, and to encourage sensible idempotent behavior.
Opinionated guidance: DevHub blog posts clearly outlining good practices for scalable, functional Python. Think: our SQL style guide (CTEs all the way!), for Python dataframe transformations.

What isn't it?

First, I don't see dbt as the ideal development experience for Python-language data exploration. At least, not initially. dbt is not an interactive notebook, and it won't be running Python code locally—it will, just like SQL, ship Python code off to the warehouse as its execution environment.

Even more than SQL, development in Python benefits from quick iteration: debugging, previewing, stepping through "fluent" transformations line-by-line. We approximate this in SQL by defining a "fluent" series of CTEs, and previewing the data in each CTE along the way, but there's no notion of sticking a breakpoint partway through and inspecting the attributes of an intermediate dataset.

We should expect this to get better over time, with companion tooling. Once someone has already developed Python code locally, and they’re ready to run it in production—reliably, in DAG order, with testing and documentation and highly visible lineage—they get to deploy it in their dbt project, rather than needing to spin up an external orchestration tool. That’s the feature.

Second, my goal with this feature is not to enable complex materializations / tasks / (web)hooks / operations written in Python. I want to draw a distinction between three types of logic that exist in dbt today:

Transformation logic, a.k.a. “business logic”: The code you write in your model, that’s different for every model. The ways you want to reshape, rename, aggregate, and contextualize your data.
Materialization logic: How the transformed data should be saved to the database, whether as a new table, an update to an already-existing table, or a mix of new and updated tables (plural). This logic is boilerplate and repetitive, intended to be reused, and exists independent of a specific model’s transformation logic.
Task logic: How DAG nodes are executed—in which order, with what parallelism, etc.

With this initiative, our primary intent is to support the first type of logic only: transformation logic expressed in Python. The goal is not to add Python-language scripting to the operational toolkit of graph execution, database management, or workflow tooling. The aim isn't to support a more powerful form of codegen, or a way to orchestrate other services from—we view those as standalone initiatives, which may intersect with Python-language transformation in interesting ways.

To that end, the initial version of Python models won't support more complex materializations. For now, we are aiming for parity with table and incremental, and we're thinking about the right construct to support Python models that actually want to return functions, such as prediction functions. (That may be a udf model materialization, or a new node type dedicated to UDFs. This is work we plan to take up later this year.)

For now, that also means that one Python model returns one database object. We think this remains a useful abstraction, in SQL and in Python, with some acknowledged limitations. (Think: sharded tables, lambda views.) Adding support for models that return multiple database objects will remain a separate initiative, requiring other changes within dbt-core, and deserving of its own dedicated thinking. Python-language models may prove to be a good conduit for that functionality down the line.

I see this project as distinct from the initiative around adding support for "external" nodes, which might run Python-language transformations within external ML automation tools (e.g. Continual), or orchestrate another external service, tightly integrated with the dbt DAG via eventing + metadata. This is an initiative we're exploring separately, with active discussion in #5073, as an important way to support related functionality on databases that lack built-in capabilities for running Python.

We're not building a new way to run tests, although "unit tests" is always a hot topic of conversation, and does feel like even more of an expectation in Python-land. This is something that I'm continuing to discuss and think more about, in the context of the new pytest framework that we released for adapter maintainers in v1.1: #4455 (comment).

Finally, this feature should never be a new barrier to using dbt. If you're someone who loves SQL, and who uses it every day to solve 100% of your problems—great! We're always going to keep investing in the experience of dbt SQL models. You do not need to know Python to use dbt, and you will not need to learn how to write Python dataframes if you don't want to. If, on the other hand, we manage to convince you that you can and you should, dbt will be ready to meet you where you are.

Very eager to hear your thoughts, concerns, ideas, questions, aspirations, and meditations. Thread them up below!

leoebfolsom · 2022-05-17T20:32:06Z

leoebfolsom
May 17, 2022

I'm really excited about this! Thanks for all the detail!

This makes a lot of sense for complex statistics that just make sense to include in a transformed table. (I come from an R background, but I'm more than happy for the opportunity to improve my Python skills; and I understand that Python gets me what I need to get the job done when I would normally go for R. Especially now that Python (or maybe it's pandas specifically? idk) is rolling out pipes!)
Genuinely curious if people will build Python-only dbt projects because they are more comfortable or happy in Python vs SQL. Are you encouraging people to use SQL unless they "need" Python ... for their own good? Because that's dbt's (reasonable, imo) opinion? Would there be some problem with just using Python all the way through the DAG? I wouldn't want to do that, but I'm pretty sure there are some people who are allergic to SQL out there. Will there be popups saying "are you sure you need Python for that?" Ha.
Will I be able to dbt run -m sandwich_revenue.py, or does the file extension have to be removed? Asking for a friend.

6 replies

leoebfolsom May 17, 2022

@jwills oh baby I will try to do that in between diaper changes!! 🚀

jtcohen6 May 18, 2022
Maintainer Author

This makes a lot of sense for complex statistics that just make sense to include in a transformed table.

Yup, this is the thing.

Are you encouraging people to use SQL unless they "need" Python ... for their own good? Because that's dbt's (reasonable, imo) opinion?

I think so, at least at the outset. It took us a long time, and a lot of hard thought and work, to develop the right development experience and recommended constructs for SQL models. SQL is still unreasonably effective for most kinds of transformations, and the relational algebra underpinning it is intuitive for many many people.

Aside: As someone who has a lot of (food) allergies, I always took minor offence to people who expressed their strong preference or habitual aversion as an "allergy." I don't think anyone is really allergic to SQL, I think we should continue trying to convince them it's a delicious and nutritious part of a balanced data breakfast. Compared to which, Python feels like dessert. You should still eat your vegetables first (cleaning, staging, preparing, dimensional modeling in SQL), and only then, some ice cream, not too much, mostly plant-based.

Will there be popups saying "are you sure you need Python for that?" Ha.

Not this exactly, but maybe? We're not ruling it out. In particular, thinking about how we might enforce the idempotency and modularity of dbt Python models. If people sneak "materialization"-style code within their Python model's "transformation" logic, I think we're well within our rights to warn, "We don't think you should be doing this!" It's tougher to do that in SQL, because the syntax is so much more declarative, but of course there are even dbt users today who stick all sorts of terrifying delete/update/etc clauses into pre/post hooks. If you really know what you're doing, and are ready to take your life into your hands, it shouldn't be dbt's place to stop you, but it shouldn't be an easy path to wind up there accidentally.

leoebfolsom May 21, 2022

Thanks @jtcohen6 , all makes sense! I agree no one is actually allergic to SQL, and it is a silly thing to claim, so I will follow your lead and no longer humor/perpetuate that phrasing. Food allergies are no joke.

Your comments got me thinking about all kinds of tools and products that can be used in terrifying ways that the makers never imagined. Q-tips come to mind, as well as jailbroken smartphones and poorly installed car seats. I had never thought about the concept of "use in that unsanctioned way at your own risk" or "the warranty will be voided if you ..." in the context of software, but that's what I'm reminded of here.

If you're into plant-based ice cream, I highly recommend FoMu in Boston!

jtcohen6 Jun 3, 2022
Maintainer Author

(I've been to FoMu, it's amazing!)

leoebfolsom Aug 14, 2023

Update: my toddler (16 months) is allergic to peanuts.

randypitcherii · 2022-05-17T22:12:07Z

randypitcherii
May 17, 2022

This feels like a pandora's box to me.

We gain the ability to orchestrate python alongside SQL, but at what cost?

SCOPE CREEP: shouldn't dbt be focused on SQL transformation? Why stop at python? Why not just let folks do bash operators and API calls and handle data loading here as well and even building full BI tools in visualization nodes?
LOSS OF THE SQL USER: in my experience, dbt's strong opinion about sql-first transformation is half the value. If my stakeholders can throw python around that breaks the promise of user just needing to know sql to become as productive and valuable as our most sophisticated data engineers.
OPPORTUNITY COST: this is going to absorb enormous team time, attention, and support effort. When tools like airflow exist to bridge the coordination gap between not-dbt land and dbt-land, why solve a solved problem?

1 reply

jtcohen6 May 18, 2022
Maintainer Author

@randypitcherii Thanks for bringing this perspective, it's badly needed in the conversation!

My goal here is not to rebuild Airflow, or Spark. It's motivated by the belief that we can bring the same opinionated practices to data science / ML work, with clear guidance and guardrails.

There are legitimate reasons to need Python. As much as I (honestly) have doubted it over the past few years, it comes up in conversation after conversation. "dbt is amazing, but then we need to have a parallel set of tools, with much less visibility / testability / reliability, to handle our team's predictive work." If we don't give people this capability, we risk falling short of our mission of empowering analysts to disseminate organizational knowledge—new types of knowledge, with new types of colleagues, but the same principles as always.

Suffice to say: If we can't find the "dbtonic" way of supporting Python, we shouldn't do it. That's our biggest goal over the next few months, and it takes much higher priority than supporting the fanciest and most complex Python workflows. That can come in time, if it ever should.

Remember what most people's SQL looked like in ~2016? Business logic mixed with boilerplate DDL mixed with 100 lines of code to perform a snapshot/upsert/??. I hope that, if it's starting with dbt.ref() and ending with a dataframe, and we can collectively create style guides and opinionated best practices to help people get from point A to B in a way that scales—scales for computers, and scales for their fellow human colleagues. (To that end, I'm already trying to work with Databricks, Snowflake, and others to demystify functional / PySpark-style dataframe transformations. Yeah yeah, you can do anything with pandas locally, you can loop over every row of a million-row table—but what should you do?) We simply must use our leverage here to make this work more accessible. It's that or nothing.

tnightengale · 2022-05-18T02:10:41Z

tnightengale
May 18, 2022

Shower thoughts:

The gap between analytics and data science teams is spot on
dbt is a better DAG ochestration tool than Airflow (when able to work with SQL only):
- Airflow is at it's best when running kubernetes operator to simply orchestrate a series of containers
- This is because dependencies are always a problem: I suspect users will encounter similar frustrations/limitation is some cloud provider's python environments
The mention of webhooks is pertinent: being able to simply have node pass a request with a 'ref' to a service, to select and process the data unlocks completely customizable runtime environments for ML projects
On ML: I wonder how train/test flows around sklearn and writing dataframes will emerge under this paradigm

To be a contrarian to a recent contrarian post: it feels like dbt is still for me. 😄

Cheers! @jtcohen6

6 replies

ChenyuLInx May 18, 2022
Maintainer

It is definitly easier to have just non-conflicting package version across all models so that user can pre-config the cluster with a requirements.txt we might be able to provide.

But in the case there are different versions of package supplied in the following case.

Spin up a cluster / procedure, install numpy==1.22.3, run model X, spin down. Spin up a new resource, install numpy==1.21.6

This is really depends on the Cloud Provider's functionalities, and we even have the option to install package in python script that I don't think is the right approach.
Or maybe we can have generate a requirements.txt with both versions in it and have some adapter specific code to generate some instructions? @tnightengale Can you elaborate a bit more what the current frustrations/limitations are?

In terms of dbt.config(packages=['numpy']) vs dbt.config(packages=['numpy==1.22.3']), I think user only need to write the second type of package requirement because it need a specific version. For the first case, as long as it is imported in the code, we can parse them, and merge the parsed imports in the code with user supplied packages in dbt.config

tnightengale May 23, 2022

I don't face an issue with conflicting requirements today in my python workflows. This was a problem for me in the past when using Ariflow and actually executing operators directly in Airflow. We solved that problem by relegating Airflow to exclusively handle the ordering of tasks, and the tasks themselves became Docker containers, that could each have their own requirements.

I think the takeaway is requirement management via dbt will ultimately depend on the cloud provider's optionality for python runtimes.

That being said, everytime I see requirements.txt I shudder, because I don't like how pip does not constrain dependencies. I would recommend looking into leveraging poetry to generate a stable requirements.txt.

I think dbt-core can provide more value from offering a solid way to test the dependencies, rather than specify them in the first place.

For example: imagine someone creates a bunch of model.py files in their project, and they have a requirements.txt.

It would be great if the user could run a command to:

install those dependencies in a mock environment (which would emulate their warehouse python environment)
run pylint behind the scenes to ensure that every import statement in the models can be resolved

Now the user can be confident that the requirements.txt installed in the could runtime will work for the current models.

That's just one way of doing this, but being able to validate the requirements (not just statically generate them) would be great.

jtcohen6 Jun 3, 2022
Maintainer Author

We solved that problem by relegating Airflow to exclusively handle the ordering of tasks, and the tasks themselves became Docker containers, that could each have their own requirements.

This feels exactly like the model that dbt is taking. Instead of Docker containers, we're talking about remote runtime environments: Snowpark procedures, Databricks notebooks, ...

The risk is if there are package conflicts between two models that want to share a remote execution environment (= same Databricks interactive cluster). The risk of that will vary based on the behind-the-scenes implementation on each data warehouse.

run pylint behind the scenes to ensure that every import statement in the models can be resolved

Ooh, this makes a lot of sense to me! Especially given that it's in keeping with the role of dbt-core here, as configurer, orchestrator, and static code analyzer, but in no way shape or form the actual executor of this code.

groodt Jul 27, 2022

pip-tools and the recently released experimental JSON installation report in pip are also worth considering for pinning python package dependencies.

Another thought Im having about this, do the dbt transitive dependencies leak into these python nodes, or are they truly isolated for remote execution?

jtcohen6 Jul 27, 2022
Maintainer Author

@groodt There's no risk of conflict between dependencies of dbt-core and dependencies of particular Python models, as all Python models are running in a totally isolated environment (the data warehouse / remote engine).

It would be a different story if you first develop that Python code by running it in a local environment (e.g. Jupyter notebook), and it's the same environment in which you have dbt installed. Finding the right development flow is something we're looking to iterate on and refine over the course of the beta period.

davehowell · 2022-05-18T13:09:04Z

davehowell
May 18, 2022

@jtcohen6 I love your work, amazing as always.

I understand the dataframe-out approach, and that dbt will add boilerplate using that materialized config to convert the df to a table. Ignoring the obvious ML & stats applications, I am wondering about using this as an alternative to macros i.e. just for generating SQL, outputting a select statement that is then wrapped and used like any other sql dbt model. That would be a use-case that bypasses working with dataframes altogether, for example if you could do dbt.ref(model="model_name", type="sql") to just bring in the fully-qualified-database-object name. It would be a build/parse-time use of python rather than shipping python to the data platform, so could run in the context of the dbt service, and target warehouses that don't run some flavour of pyspark or have any python runtime installed. It could be dbt-core running on a container with the python packages baked into the image to take advantage of docker layer caching, similarly on dbt-cloud. I realise this is quite different from the intention, but some of our users have expressed frustration at the ergonomics around macro/jinja development in the dbt sandpit jinja context where you can't do things like type hints, linting, debugging or inject your own python functions.

2 replies

jtcohen6 May 18, 2022
Maintainer Author

Thanks @davehowell! I totally hear where you're coming from on this. Our intent with this feature is not to run any user-provided Python at parse-time, or use it as a Jinja alternative for templating SQL models. There are definitely nice-to-have features that are missing while doing Jinja development, not least the ability to run custom code written in a friendlier language (Python). I think UDFs could have a role to play here.

for example if you could do dbt.ref(model="model_name", type="sql") to just bring in the fully-qualified-database-object name

This is fair! We've opted to have dbt.ref return an actual dataframe pointing to the database.schema.identifier of model_name. This is what we common frameworks doing, and dataframes seem to be the right foundational unit for Python-language data manipulation. It would be possible for us to return the fully qualified name of the relation as a string instead.

ChenyuLInx May 18, 2022
Maintainer

@davehowell Thanks for taking the time to share this perspective.

I can totally relate to the part that type hints, linting, debugging the macro might not be as native a experience as writting python. But bring full support of python there can create other problems to solve, like security concern on dbt-cloud side etc.

We are thinking about the problems you highlighted. But you are right that this is quite different from the intention of the python model and we would want to treat them separately. Maybe we should provide better tools for developing macros?

jwills · 2022-05-18T16:39:10Z

jwills
May 18, 2022

@jtcohen6 @ChenyuLInx thank you all for writing this up, really exciting stuff. I am interested, as an adapter author (in this case, of dbt-duckdb) to understand how I can extend my adapter to support the python data models and would love to know when some docs and/or example code will be available for me to peruse.

I should be up-front with you about the fact that, since DuckDB is an embedded DB that only runs in-process, I am going to totally abuse the hell out of this feature to run Python code locally, install whatever crazy packages I feel like, etc., etc. because I am very excited about what the combination of dbt, DuckDB, and Python packages like scikit-learn and (gulp) prophet could mean for orchestrating data science-centric workflows.

1 reply

Alex-Monahan May 18, 2022

As an addition to what @jwills said above, I think DuckDB could be used to eliminate the barrier you mention here:

it won't be running Python code locally—it will, just like SQL, ship Python code off to the warehouse as its execution environment.

Since DuckDB can execute fast analytical SQL locally (in process), if we could also run Python locally it could make for a really great development environment or small scale "deployment" environment!

eap · 2022-05-19T17:05:03Z

eap
May 19, 2022

I like the idea a lot! A few years ago I implemented and maintained an internal jury rigged version of this exact feature and it worked very similar to your description. The function had to return a dataframe, but the inputs were just rendered table names and the caller was responsible for accessing the source data (we had some helper libraries). That use case has since gone away and we no-longer maintain our adapter modifications, but we get this ask occasionally.

I agree with your decision to rely on system packages without adding an overloaded or opinionated package management layer. We build and package dbt in a somewhat customized way and adding layers to package management could make our deployment harder.

If you have questions about GCP implementation details I'd be happy to chat and tell you what we've done in the past.

3 replies

ChenyuLInx May 19, 2022
Maintainer

Hey @eap, thanks for the input!

For the GCP implementation, the first implementation, we plan to use the python dataproc library to submit the pyspark job to a dataproc cluster(like this example here), and use the spark-bigquery-connector to get/write data with bigquery.

Then maybe provide another option to use dataproc-serverless? I am not sure whether the performance characteristics are more favorable there. Would love to hear your thoughts about it.

And any other thoughts you have about the GCP implementation I would love to hear them!

davidgasquez May 23, 2022

And any other thoughts you have about the GCP implementation I would love to hear them!

You've probably already explored these, but wanted to share other potential approaches for the GCP implementation.

Using WASM in BQ
Using BQ Remote Functions

ChenyuLInx Jun 21, 2022
Maintainer

@davehowell Thanks for the link and the discussion on Slack!!

Using WASM in BQ
Using BQ Remote Functions

These two looks really cool. I think we are not going to do it that way for now because our current goal is to let user write complete transformation logic in python with input being tables and generate an artifact(mainly table, maybe UDF later on?). And for those two functions, looks like user going to write some logic to transformation for a row, and we then run that logic on the whole table? I feel like we are not going to do it that way in the first iteration, but maybe later iteration or an adapter specifically for it? I am not sure.
The scaling of HTTPS endpoint for BQ Remote Functions feels really neat!

Other than the Dataproc mentioned above, we are also looking into Vertex AI. But we are thinking of sticking to Dataproc as a first step and see where we want to go next. Happy to hear any more thoughts on it.

sungchun12 · 2022-05-24T14:29:57Z

sungchun12
May 24, 2022

Some quick things I'm thinking through.

How do we encourage developers to do things in SQL where it's most effective and efficient vs. dataframe operations? Example: 10 step DAG in which 8 steps are python models. NO ML libraries are being used. After a developer invokes dbt run, can the terminal print a nudge like,

Hey there friend. I noticed your dbt run contained these 8 python models with NO ML libraries, we recommend these be better performed using SQL. Here's a link to docs for examples on why this matters to you! <docs link of clear examples where SQL wins over python in transformations>

How do we audit dbt projects to ensure there's a healthy ratio(as defined by the developer) of SQL to python models? Example: there are 1000 sql and python files in a dbt project. After a developer invokes dbt audit, it shows,

"Hey there friend. I noticed your dbt project has 800(80%) SQL files and 200(20%) python files. This is past your threshold of 18% python files. Is this what you want in life?"

5 replies

jthandy Jun 6, 2022
Maintainer

@sungchun12 I'm just curious--why is it that you feel that dbt should be opinionated as to which language a user choose to express their business logic? sometimes the best language is the one that the human knows best (and sometimes that is Python!).

I think there are good answers to this question, but I'm v curious to know the angle that you're coming at it from.

sungchun12 Jun 6, 2022

@jthandy

With new people coming into the dbt story via python, they may think SQL should be ignored entirely in favor of python-majority workflows. If this behavior isn't discouraged explicitly, we'll erode the promise dbt made about WHO can build pipelines,

"Anyone who knows SQL"

I have seen in practice data scientists/ML engineers do things in python/pandas that could have been much more effective/efficient as a SQL query. However, given their interface 90% is a jupyter notebook, they have every incentive to live in python-only land. If we don't give this new audience healthy boundaries/incentives, they'll presume the same applies to dbt. It's the same philosophy we applied in favor of idempotence vs. encouraging everyone to use inserts/updates/deletes as a default way to work with SQL. People were wondering why dbt doesn't treat inserts/updates/deletes as first class ways to do things. It's because dbt knew this makes life not fun when things go wrong and when you need to recreate tables based on various conditions. If all your tables are stateful, you have to babysit a Rube Goldberg machine of SQL workflows. Same premise with python-majority workflows, why insist on everything being a pandas dataframe transformation when you're simply trying to join 2 tables? If something goes wrong, it'll insist on the python-only people calling the shots. And the python-only people will have to babysit a Rube Goldberg machine of scrappy python workflows.

Even if the final implementation doesn't include my examples above. My main goal is to see SQL and Python users unite on a workflow vs. encourage a power struggle like we've seen in data workflows in the past. For example, I've literally seen data engineers refactor data analysts' great SQL code into pyspark because it's the "best way to transform data by default". And these were simple joins! No machine learning libraries. Simply people presuming the prestige of a specific framework/language supersedes the practical impact of using the right tool for the right job. This also forced the data engineers/data scientists to be bottlenecks for data analysts. When things go wrong, the data analyst has a low ceiling to help out once python-majority workflows are the norm.

jthandy Jun 21, 2022
Maintainer

@sungchun12 Appreciate you sharing!! And I 100% agree that this "scar tissue" exists in the industry.

I come at this a little differently, though. By supporting Python we stating that this is a first-class way of doing data transformation work. It's totally appropriate to express your logic in the language of your choosing. In Snowpark, in particular, the dataframe API is just a different programming construct on top of the exact same optimizer. You can express the same join in SQL or a Snowpark dataframe and the same underlying machine code will be generated. Neither of these is right or wrong, but there are certainly good reasons to choose one or the other.

as you mention: team composition. many (but not all!) teams will find that the largest possible base of users will find SQL accessible.
machine-parse-ability. sql is easier to reason about for machines and therefore certain things (like column-level lineage) are more tractable.
testability. Python is far more testable than is SQL.
expressability. Python allows you to express a larger set of potential computations on your data.

So--tradeoffs. Just like in so many things in the world of software engineering, the job of the framework is to nudge users towards behaviors that are considered to be long-term beneficial, but ultimately to support the choices that users and teams make.

This does not mean:

enforcing some type of upper limit for % of models in Python
onerous user warnings when users write Python code ("hey, have you tried writing that join in SQL?!")

This could mean:

starting walkthroughs at SQL modeling.
making sure to keep a careful eye on the experience of both personas to make sure that we continue to invest in both of them.
open to other ideas...

sungchun12 Jun 22, 2022

@jthandy Thanks for the thoughtful response!

To be clear, this isn't an attempt to make "SQL-first and only" a dogmatic mandate for how dbt or how any developer interacts with data. This is about persona incentives, power dynamics, and how people's lives and careers play out when those aren't empathized with in great detail.

I'm generally agreed with your patterns and anti-patterns. I wouldn't like to see 10 log lines in my terminal output shaming me for doing python vs. SQL or vice versa, so I'm happy to take back my automatic dbt terminal output suggestion. But let's say, this is a 2000 file dbt project, I'd love dbt to give me an opinion for a snapshot in time how my dbt project is evolving. We're already seeing this motion with the great work shipped with: https://github.com/dbt-labs/dbt-project-evaluator
People are hungry for someone to SHOW them explicit best practices, not have that live in a blog post alone. And that's such a wonderful hunger to solve for!

Maybe it's about evolving the above package to include python support to see if some things could be more efficient in SQL and vice versa for things done in SQL that could be better done in python. We've already done this in the cultural conversation with SQL, analytics engineers have such proud careers because of dbt. I want dbt to hold that so precious without becoming SQL-supremacist at the expense of python pragmatism(as you listed out for why python is genuinely a better choice in certain situations).

For your "This could mean" section, you're hitting a resonating tone with me: this is about keeping an eye on the culture of data teams and how their incentives build vs. tear apart those teams. My ultimate goal is for data engineers, analytics engineers, and data scientists to respect each other as peers for the outcomes they drive, not the tools they use.

MichaelTiemannOSC Aug 13, 2022

I'm came to SQL later in life (after 10+ years of C++) and Python really late in life (after another 10 years, mainly due to Jupyter notebooks). I offer the following perspective:

SQL select queries are basically Python list comprehensions with awkward syntax, but amazingly powerful execution engines that have awesome memory efficiency
dbt 1.3 python support is where its at--being able to do more work in Python to avoid YML (while not trying to avoid SQL)

What I'm particularly keen to do within Python is to enrich data with metadata to achieve greater data interoperability from externally sourced datasets (climate, economic, and financial data in particular). We are using DBT to build bridges between our SQL ingestion processes and OpenMetadata via data transformations and better metadata labeling. FWIW.

amirbtb · 2022-06-02T19:13:50Z

amirbtb
Jun 2, 2022

Please add the link to this conversation in https://docs.getdbt.com/docs/running-a-dbt-project/dbt-api so people can know there is a work in progress on this awesome feature !
Thanks 🙏

0 replies

groodt · 2022-07-27T13:50:53Z

groodt
Jul 27, 2022

Some reference material for the inevitable questions about how packaging in python can be frustrating:
https://dustingram.com/articles/2018/03/05/why-pypi-doesnt-know-dependencies/

I suspect there are going to be quite a few questions about why certain packages install locally (or mac or windows users) and they install differently on linux (typically in the cluster).

0 replies

jonathanneo · 2022-08-05T06:47:25Z

jonathanneo
Aug 5, 2022

Would developers be able to import their custom pre-trained models and use it as a 'transformation' (i.e. batch prediction)? Example:

def model(dbt):

    dbt.config(
        materialized='table'
    )
    fct_sales_pipeline = dbt.ref("fct_sales_pipeline")

    import pickle 
    model = pickle.load("my_model.pkl")
    y_output = model.predict(fct_sales_pipeline[["days_open", "stage", "credit_score", "company_headcount"]])
    df = fct_sales_pipeline.merge(y_output) # merge using row index 
    
    # your final 'select' statement
    final_df = df.select("*")

    return final_df

2 replies

jtcohen6 Aug 8, 2022
Maintainer Author

@jonathanneo I believe so! The exact mechanism will vary by data platform, until/unless we decide to build a proper abstraction for this into dbt. It's something we're still investigating and learning more about, so your thoughts here are very welcome.

Snowflake Load to an internal stage (e.g. from S3), then load into a Snowpark Python UDF: add_import, "Reading files with a UDF handler"
Databricks The MLFlow framework on Databricks offers a lightweight wrapper for saving and loading artifacts to/from DBFS (Databricks docs, MLFlow docs). So mlflow.pyfunc.load_model() instead of pickle.load(), as the input to predict()
(GCP Dataproc is the one we're fuzziest on right now—but I think it's possible there, too, by saving to / loading from Cloud Storage)

miralseye May 19, 2023

Would developers be able to import their custom pre-trained models and use it as a 'transformation' (i.e. batch prediction)? Example:

def model(dbt):

    dbt.config(
        materialized='table'
    )
    fct_sales_pipeline = dbt.ref("fct_sales_pipeline")

    import pickle 
    model = pickle.load("my_model.pkl")
    y_output = model.predict(fct_sales_pipeline[["days_open", "stage", "credit_score", "company_headcount"]])
    df = fct_sales_pipeline.merge(y_output) # merge using row index 
    
    # your final 'select' statement
    final_df = df.select("*")

    return final_df

@jonathanneo in order to load your .pkl file inside your python model in dbt, you first need to upload your file into a stage inside your current schema. You can do it by using snowflake connection for python. Then use the following code inside your python model:

import sys
import os
import gzip

def model(dbt, session):
dbt.config(
materialized = " table",
packages = ["pandas"],
imports = ['your_stage/my_model.pkl.gz']
)
with gzip.open(os.path.join(sys._options["snowflake_import_directory"], 'my_model.pkl.gz'), 'rb') as f:
uncompressed_content = f.read()
my_model = pickle.loads(uncompressed_content)
return df

cuong-pham · 2022-08-17T06:45:09Z

cuong-pham
Aug 17, 2022

Sorry I am new to the conversation, someone just brought this up to me today. One thing i'm not quite clear is where the computation actually happens? Is it at the DB or at dbt process? If it's the former, then the Db engine needs to support this kind of operation. If it's in dbt process, i really don't want to provision massive infrastructure outside of the main db to run some in memory python process.

6 replies

Alex-Monahan Aug 17, 2022

What about executing locally? If I have DuckDB and Python installed, can dbt just use my local Python environment?

jtcohen6 Aug 17, 2022
Maintainer Author

Totally, good clarification. If your "data platform" / Python runtime is actually running locally on the same machine as dbt (e.g. DuckDB), it all runs on your local machine. The important bit is that all compute must happen in a ~~process~~ thing separate from dbt.

jwills Aug 17, 2022

@jtcohen6: "All compute must happen in a process separate from dbt."
@jwills: "Hold my beer."

Alex-Monahan Aug 17, 2022

Ha! I agree with Josh - can't it all be in the same process? DuckDB is an inprocess database!

jtcohen6 Aug 17, 2022
Maintainer Author

Excited to see what you come up with :)

ucdmkt · 2022-08-23T19:02:22Z

ucdmkt
Aug 23, 2022

Thank you so much @jtcohen6 for the inspiring proposal. Just wanted to share my excitement.

A lot of discussion in this proposal reminds me of so-called feature management in the context of ML Engineering and MLOps, such as Feature Store, e.g. tecton.ai.

There seems to be really useful synergy between two ecosystems; whereby DBT covers raw(er) fact tables and column-level transformations thereof, and Feature Store system covers creation of featureized datasets, and their subsequent use for ML model training and batch / online inference. Now that both ecosystems become able to speak the same languages (i.e. Python, at the same time the Feature Store ecosystem is also becoming more native to SQL, such as Snowflake-Tecton), I could not help thinking those could be match made in heaven in many regards of ML Engineering, such as feature semantics, dataset management (versioning and lineage), feature -> model provenance, to name a few.

I'm yet to wrap my head around where the hand off should occur, and what the integration point looks like. I found @drewbanin 's recorded talk on the fusion of DBT and Feature Store. If there is any prior discussion or thoughts given if this direction could be accelerated with introduction of Python DBT, and how, I'm all 👂 to learn from everyone here.

P.S.
I believe this point of view is not new; I see a relation to the Bronze-Silver-Gold data architecture in Delta Lake as a different flavor of the same concept.

1 reply

lostmygithubaccount Aug 23, 2022

This is definitely in the right direction! We want to head toward breaking down the silo between the data & ML worlds, and feature stores seem like an obvious first target. I don't have too much to share yet, but stay tuned! We'll be opening some further discussions in the coming weeks and months.

xg1990 · 2022-08-25T12:43:14Z

xg1990
Aug 25, 2022

This is a super mind-blowing feature!
Would we support dummy models like this?

def model(dbt):

    # Do nothing about data transformation

    important_table = dbt.ref("important_table")

    # send out notification
    if data_check(important_table):
        send_to_stakeholders(important_table)
    else:
        notify_developers("important_table is corrupted!")

    important_table.save_to_some_place("/path/to/cloud/storage")

    return None

It can then overthrow airflow 😱

1 reply

jtcohen6 Aug 29, 2022
Maintainer Author

@xg1990 Glad to hear you're excited! I want to be clear that this is not our intent for Python models as a feature, but it is something we're discussing as a separate feature: #5073

(You might be able to hack this together on certain data platforms, and not on others. Calling an external function in Snowflake is possible in SQL models, after all...)

visch · 2022-09-07T18:03:28Z

visch
Sep 7, 2022

Is there an option to have the Python run locally instead of only on the Warehouse?

I run DBT as a part of a host of other tools, this feature sounds really great but I don't love the idea of pushing my execution environment for Python to the Warehouse.There's a lot of development reasons for this as well (breakpoints, etc)

Right now fal seems like one approach, today I just run the python seperate

dbt run -m +python_model --exclude python_model
python model.py
dbt run -m python_model+

If we could make the execution environment be the same as the dbt instance that would be amazing for me! Still a great feature :D

1 reply

Alex-Monahan Sep 7, 2022

In my (very biased) opinion, a local Python environment with DuckDB as a local SQL environment could really be awesome!

rpedela-recurly · 2022-10-13T14:11:45Z

rpedela-recurly
Oct 13, 2022

BigQuery supports remote functions now via Cloud Functions and Cloud Run. I wonder if that would be a better fit over Dataproc for the intended use cases. https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions

0 replies

amirbtb · 2022-10-13T19:05:19Z

amirbtb
Oct 13, 2022

In addition to what @rpedela-recurly suggested, I would like to put on the radar that BigQuery supports stored procedures for Apache Spark in preview.

0 replies

RK2K172 · 2022-10-24T13:56:26Z

RK2K172
Oct 24, 2022

Hello All,
Does anyone has a detailed steps of how to configure dbt v1.3 for python model run using BigQuery on Dataproc? The details provided on official site -->
[https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms]
is incomplete and does not explain in detail what all configuration need to be done for model to run. I tried already and no success yet. On official page, there is a link for BigQuery profile which is broken too (Page Not Found):
[https://docs.getdbt.com/docs/building-a-dbt-project/building-models/bigquery-profile]

0 replies

aaronsteers · 2022-10-24T19:36:56Z

aaronsteers
Oct 24, 2022

@jtcohen6 - First, wanted to echo what others have said - the feature is very exciting.

I want to ask how you are thinking about static analysis of dependencies and schema. Said another way, does the code necessarily have to be run (building the datasets) in to detect its own relationship in the DAG?

Day one problem

When building the DAG for the first time (and no data exists), where does the python script sit in the DAG?

Day two problem:

What happens when the list of dependencies change in comparison from prior executions?

Imagining a script that reads data from one ref() and then pulls from other ref()s based on the data found in the first? On a sample pass, or as the data changes, this could change the position in the DAG.

At minimum, would dbt pause execution or throw an error if the ref() evaluates to a table that just isn't built yet - for instance if a prior execution only called ref('table_1') but this execution also wants ref('table_2')?

Comparison of other tools

Some other tools like Terraform with similar DAG sequencing challenges take two passes through the code - one pass to walk the dependencies and build out the DAG and a separate pass when actually 'building'. This process means certain types of dynamic inputs are disallowed if they can't be statically evaluated.

0 replies

NorhanShaker1990 · 2022-12-07T22:58:38Z

NorhanShaker1990
Dec 7, 2022

How to load a npy file to my python model knowing that it is saved to the same python model directory ?
It is giving "file not found error"

def model(session,dbt):
    do something....
    call PredictPayment function
    return dataframe

def PredictPayment(dbt):
    Encoder = LabelEncoder() 
    Encoder.classes_ = np.load('Encoder.npy',allow_pickle=True)
    .......

1 reply

NorhanShaker1990 Dec 7, 2022

the same happens when reading a model json file saved in the same path

RickyEsclapon · 2023-03-03T10:58:44Z

RickyEsclapon
Mar 3, 2023

Has anyone been able to successfully run python incremental models on GCP? I have some python models that work completely fine with materialized = "table", but don't work with materialized = "incremental" and I really haven't been able to figure it out 😅

0 replies

shugybugy-assaf · 2023-03-06T15:41:06Z

shugybugy-assaf
Mar 6, 2023

Does anyone intend to write an adapter that allows running the a Python model on the same machine running DBT ?

0 replies

jwills · 2023-03-06T15:43:38Z

jwills
Mar 6, 2023

@shugybugy-assaf folks already have; dbt-duckdb and dbt-fal work this way.

0 replies

dnascimento · 2023-03-24T07:18:14Z

dnascimento
Mar 24, 2023

Hi, what's the best way to run python models or at least python scripts on Redshift / AWS?

0 replies

ibeliever · 2023-04-03T12:44:26Z

ibeliever
Apr 3, 2023

I was going through https://docs.getdbt.com/docs/build/python-models#specific-data-platforms, understanding how python models can be submitted as pyspark jobs on dataproc cluster? is there any good example around it?

0 replies

karthikchowkula · 2023-05-01T05:08:50Z

karthikchowkula
May 1, 2023

how to print dataframe to terminal by using python dbt core model?

0 replies

shishircc · 2023-08-14T05:32:52Z

shishircc
Aug 14, 2023

Is support for Glue/Vanilla Spark sources on the roadmap ? Since glue is spark, would this be possible ? Redshift ?

Is there anyway someone could contribute PR to enable this ? I would be glad to do that if someone could point whether such PR is possible and may be considered.

Dbt is quite popular and at times engineering team want the edge cases to be handled using python or at least have that option available to decide to adopt dbt.

0 replies

shishircc · 2023-08-25T08:32:31Z

shishircc
Aug 25, 2023

Could this work with AWS Glue or Spark ?

0 replies

joshtree41 · 2023-09-27T01:54:23Z

joshtree41
Sep 27, 2023

~~Can someone point me to the type/class for dbt in the below block?~~

import ...

def model(dbt, session):
    ...

~~Much appreciated!~~

Scratch that! I figured this out. The class literally does get created on the fly (at least on the snowflake/snowpark side).

The stored procedure get's wrapped up with some fun stuff like this:

config_dict = {}

class config:
    def __init__(self, *args, **kwargs):
        pass

    @staticmethod
    def get(key, default=None):
        return config_dict.get(key, default)

class this:
    """dbt.this() or dbt.this.identifier"""
    database = "database"
    schema = "schema"
    identifier = "identifier"
    
    def __repr__(self):
        return 'database.schema.identifier'


class dbtObj:
    def __init__(self, load_df_function) -> None:
        self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
        self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
        self.config = config
        self.this = this()
        self.is_incremental = True

I do think an example of this generated code should get added to the docs though! Or maybe add a class that emulates this structure so we can add type hints the the model methods.

0 replies

dbt Python models #5261

jtcohen6 May 17, 2022 Maintainer

Background & context

Who is it for?

Why build it? Why now?

What is it?

Which adapters?

The first cut

For later:

What isn't it?

Replies: 28 comments · 36 replies

jtcohen6 May 18, 2022 Maintainer Author

jtcohen6 Jun 3, 2022 Maintainer Author

jtcohen6 May 18, 2022 Maintainer Author

ChenyuLInx May 18, 2022 Maintainer

jtcohen6 Jun 3, 2022 Maintainer Author

jtcohen6 Jul 27, 2022 Maintainer Author

jtcohen6 May 18, 2022 Maintainer Author

ChenyuLInx May 18, 2022 Maintainer

ChenyuLInx May 19, 2022 Maintainer

ChenyuLInx Jun 21, 2022 Maintainer

jthandy Jun 6, 2022 Maintainer

jthandy Jun 21, 2022 Maintainer

jtcohen6 Aug 8, 2022 Maintainer Author

jtcohen6 Aug 17, 2022 Maintainer Author

jtcohen6 Aug 17, 2022 Maintainer Author

jtcohen6 Aug 29, 2022 Maintainer Author

Day one problem

Day two problem:

Comparison of other tools

jtcohen6
May 17, 2022
Maintainer

Replies: 28 comments 36 replies

jtcohen6 May 18, 2022
Maintainer Author

jtcohen6 Jun 3, 2022
Maintainer Author

jtcohen6 May 18, 2022
Maintainer Author

ChenyuLInx May 18, 2022
Maintainer

jtcohen6 Jun 3, 2022
Maintainer Author

jtcohen6 Jul 27, 2022
Maintainer Author

jtcohen6 May 18, 2022
Maintainer Author

ChenyuLInx May 18, 2022
Maintainer

ChenyuLInx May 19, 2022
Maintainer

ChenyuLInx Jun 21, 2022
Maintainer

jthandy Jun 6, 2022
Maintainer

jthandy Jun 21, 2022
Maintainer

jtcohen6 Aug 8, 2022
Maintainer Author

jtcohen6 Aug 17, 2022
Maintainer Author

jtcohen6 Aug 17, 2022
Maintainer Author

jtcohen6 Aug 29, 2022
Maintainer Author