Replies: 28 comments 36 replies
-
I'm really excited about this! Thanks for all the detail!
|
Beta Was this translation helpful? Give feedback.
-
This feels like a pandora's box to me. We gain the ability to orchestrate python alongside SQL, but at what cost?
|
Beta Was this translation helpful? Give feedback.
-
Shower thoughts:
To be a contrarian to a recent contrarian post: it feels like dbt is still for me. 😄 Cheers! @jtcohen6 |
Beta Was this translation helpful? Give feedback.
-
@jtcohen6 I love your work, amazing as always. I understand the dataframe-out approach, and that dbt will add boilerplate using that |
Beta Was this translation helpful? Give feedback.
-
@jtcohen6 @ChenyuLInx thank you all for writing this up, really exciting stuff. I am interested, as an adapter author (in this case, of dbt-duckdb) to understand how I can extend my adapter to support the python data models and would love to know when some docs and/or example code will be available for me to peruse. I should be up-front with you about the fact that, since DuckDB is an embedded DB that only runs in-process, I am going to totally abuse the hell out of this feature to run Python code locally, install whatever crazy packages I feel like, etc., etc. because I am very excited about what the combination of dbt, DuckDB, and Python packages like scikit-learn and (gulp) prophet could mean for orchestrating data science-centric workflows. |
Beta Was this translation helpful? Give feedback.
-
I like the idea a lot! A few years ago I implemented and maintained an internal jury rigged version of this exact feature and it worked very similar to your description. The function had to return a dataframe, but the inputs were just rendered table names and the caller was responsible for accessing the source data (we had some helper libraries). That use case has since gone away and we no-longer maintain our adapter modifications, but we get this ask occasionally. I agree with your decision to rely on system packages without adding an overloaded or opinionated package management layer. We build and package dbt in a somewhat customized way and adding layers to package management could make our deployment harder. If you have questions about GCP implementation details I'd be happy to chat and tell you what we've done in the past. |
Beta Was this translation helpful? Give feedback.
-
Some quick things I'm thinking through.
|
Beta Was this translation helpful? Give feedback.
-
Please add the link to this conversation in https://docs.getdbt.com/docs/running-a-dbt-project/dbt-api so people can know there is a work in progress on this awesome feature ! |
Beta Was this translation helpful? Give feedback.
-
Some reference material for the inevitable questions about how packaging in python can be frustrating: I suspect there are going to be quite a few questions about why certain packages install locally (or mac or windows users) and they install differently on linux (typically in the cluster). |
Beta Was this translation helpful? Give feedback.
-
Would developers be able to import their custom pre-trained models and use it as a 'transformation' (i.e. batch prediction)? Example: def model(dbt):
dbt.config(
materialized='table'
)
fct_sales_pipeline = dbt.ref("fct_sales_pipeline")
import pickle
model = pickle.load("my_model.pkl")
y_output = model.predict(fct_sales_pipeline[["days_open", "stage", "credit_score", "company_headcount"]])
df = fct_sales_pipeline.merge(y_output) # merge using row index
# your final 'select' statement
final_df = df.select("*")
return final_df |
Beta Was this translation helpful? Give feedback.
-
Sorry I am new to the conversation, someone just brought this up to me today. One thing i'm not quite clear is where the computation actually happens? Is it at the DB or at dbt process? If it's the former, then the Db engine needs to support this kind of operation. If it's in dbt process, i really don't want to provision massive infrastructure outside of the main db to run some in memory python process. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much @jtcohen6 for the inspiring proposal. Just wanted to share my excitement. A lot of discussion in this proposal reminds me of so-called There seems to be really useful synergy between two ecosystems; whereby DBT covers I'm yet to wrap my head around where the hand off should occur, and what the integration point looks like. I found @drewbanin 's recorded talk on the fusion of DBT and Feature Store. If there is any prior discussion or thoughts given if this direction could be accelerated with introduction of Python DBT, and how, I'm all 👂 to learn from everyone here. P.S. |
Beta Was this translation helpful? Give feedback.
-
This is a super mind-blowing feature! def model(dbt):
# Do nothing about data transformation
important_table = dbt.ref("important_table")
# send out notification
if data_check(important_table):
send_to_stakeholders(important_table)
else:
notify_developers("important_table is corrupted!")
important_table.save_to_some_place("/path/to/cloud/storage")
return None It can then overthrow airflow 😱 |
Beta Was this translation helpful? Give feedback.
-
Is there an option to have the Python run locally instead of only on the Warehouse? I run DBT as a part of a host of other tools, this feature sounds really great but I don't love the idea of pushing my execution environment for Python to the Warehouse.There's a lot of development reasons for this as well (breakpoints, etc) Right now
If we could make the execution environment be the same as the dbt instance that would be amazing for me! Still a great feature :D |
Beta Was this translation helpful? Give feedback.
-
BigQuery supports remote functions now via Cloud Functions and Cloud Run. I wonder if that would be a better fit over Dataproc for the intended use cases. https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions |
Beta Was this translation helpful? Give feedback.
-
In addition to what @rpedela-recurly suggested, I would like to put on the radar that BigQuery supports stored procedures for Apache Spark in preview. |
Beta Was this translation helpful? Give feedback.
-
Hello All, |
Beta Was this translation helpful? Give feedback.
-
@jtcohen6 - First, wanted to echo what others have said - the feature is very exciting. I want to ask how you are thinking about static analysis of dependencies and schema. Said another way, does the code necessarily have to be run (building the datasets) in to detect its own relationship in the DAG? Day one problemWhen building the DAG for the first time (and no data exists), where does the python script sit in the DAG? Day two problem:What happens when the list of dependencies change in comparison from prior executions? Imagining a script that reads data from one At minimum, would dbt pause execution or throw an error if the Comparison of other toolsSome other tools like Terraform with similar DAG sequencing challenges take two passes through the code - one pass to walk the dependencies and build out the DAG and a separate pass when actually 'building'. This process means certain types of dynamic inputs are disallowed if they can't be statically evaluated. |
Beta Was this translation helpful? Give feedback.
-
How to load a npy file to my python model knowing that it is saved to the same python model directory ?
|
Beta Was this translation helpful? Give feedback.
-
Has anyone been able to successfully run python incremental models on GCP? I have some python models that work completely fine with |
Beta Was this translation helpful? Give feedback.
-
Does anyone intend to write an adapter that allows running the a Python model on the same machine running DBT ? |
Beta Was this translation helpful? Give feedback.
-
@shugybugy-assaf folks already have; dbt-duckdb and dbt-fal work this way. |
Beta Was this translation helpful? Give feedback.
-
Hi, what's the best way to run python models or at least python scripts on Redshift / AWS? |
Beta Was this translation helpful? Give feedback.
-
I was going through https://docs.getdbt.com/docs/build/python-models#specific-data-platforms, understanding how python models can be submitted as pyspark jobs on dataproc cluster? is there any good example around it? |
Beta Was this translation helpful? Give feedback.
-
how to print dataframe to terminal by using python dbt core model? |
Beta Was this translation helpful? Give feedback.
-
Is support for Glue/Vanilla Spark sources on the roadmap ? Since glue is spark, would this be possible ? Redshift ? Is there anyway someone could contribute PR to enable this ? I would be glad to do that if someone could point whether such PR is possible and may be considered. Dbt is quite popular and at times engineering team want the edge cases to be handled using python or at least have that option available to decide to adopt dbt. |
Beta Was this translation helpful? Give feedback.
-
Could this work with AWS Glue or Spark ? |
Beta Was this translation helpful? Give feedback.
-
import ...
def model(dbt, session):
...
Scratch that! I figured this out. The class literally does get created on the fly (at least on the snowflake/snowpark side). The stored procedure get's wrapped up with some fun stuff like this: config_dict = {}
class config:
def __init__(self, *args, **kwargs):
pass
@staticmethod
def get(key, default=None):
return config_dict.get(key, default)
class this:
"""dbt.this() or dbt.this.identifier"""
database = "database"
schema = "schema"
identifier = "identifier"
def __repr__(self):
return 'database.schema.identifier'
class dbtObj:
def __init__(self, load_df_function) -> None:
self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
self.config = config
self.this = this()
self.is_incremental = True I do think an example of this generated code should get added to the docs though! Or maybe add a class that emulates this structure so we can add type hints the the model methods. |
Beta Was this translation helpful? Give feedback.
-
Background & context
Since the dawn of time, dbt has been SQL first and SQL only. We believe that SQL is the most effective, accessible way to express critical business logic in modular, tested, well-documented data transformations. We still believe that. We also believe it's time to add Python into the mix.
Who is it for?
We're envisioning two target audiences, and two target use cases:
That is, our primary intent with Python models is to extend dbt’s reach into solving new types of problems, by opening up new kinds of modeling to people for whom it was previously inaccessible: people for whom the data warehouse is their principal tool. This is in addition to, not instead of, SQL. I envision the vast majority of dbt models being written in SQL, for the foreseeable future. Maybe not forever, but for some time to come.
By corollary: The ability to define vanilla data transformations via a fluent dataframe syntax, as an alternative to SQL, is a thing this feature will make possible for some, but it isn't our primary reason for building it.
Why build it? Why now?
dbt as open source standard for data transformation
SQL gets us a lot, but it cannot get us everything. There are real limits to what can be expressed in SQL, especially as you wade further into complex statistics. So long as dbt is SQL-only, those limits will be dbt’s limits, too.
There are lots of other languages that could expand our boundaries, many of them already supported by data warehouse vendors. Of all languages we could seek to add support for—R, Scala, JavaScript (!)—Python is the best mix of flexible, accessible, and popular. It’s another lingua franca for data, and the native tongue of many existing practitioners. It will unlock the most doors for the most people.
For teams that are doing analytics work today (all in SQL), and who have aspirations of doing data science / ML work (in Python) tomorrow, dbt can make the shift even easier. No new tools necessary.
dbt as participant in the ever-evolving modern data ecosystem
Databricks has long supported Python and PySpark transformations. Snowflake is adding this support via Snowpark. Google is building better integrations between BigQuery and its ML / Spark-running services.
When data warehouse vendors release exciting new functionality, we as dbt think hard about how best to expose that functionality. Sometimes we simply add support for the new capability as-is, as a new opt-in configuration, for those who know they need it. In other cases, we try to offer useful abstractions over the new capabilities, encoding our opinionated recommendations for how they ought to be used, in the hopes of making them more accessible to more people.
Insofar as cloud warehouses are seeing a need to support data science and ML workflows, dbt should seek to be the unifying interface, just as it has become for SQL-language analytical models.
dbt as a longer-term vision for data teams & organizations
In 2016, when we said that analysts should work more like software engineers, that statement was attempting to bridge a gulf that existed at many many organizations. Analysts sat on different teams, using different tools, even from data engineers. In the 5+ years since, dbt has done a lot to bridge the gap between analysts and data engineers—it’s the tool they can share, a site of collaboration.
Today, there exists a similar gap, at many organizations, between analytics teams and data scientists, ML Ops, etc. Many times these teams are using entirely different tools, and (worse) entirely different assumptions about their core data.
I don’t anticipate dbt Python models revolutionizing the day-to-day work of dedicated ML Ops engineers on the day we release the feature, or even 6 months afterward. But we believe that gap, in tooling and collaboration, should close. Let’s not underestimate the change we can accomplish within a few years.
What is it?
Users write
.py
model files. dbt parses that model to understand its configuration and DAG lineage (like Jinja, without Jinja). dbt ships that Python code off to the data warehouse, where that Python transformation is executed (just like SQL). The result is a table in the database, with the data resulting from that transformation (just like SQL models).Which adapters?
This capability will be available on data platforms that support running Python-language transformations at scale.
For starters, we're working on:
spark-submit
. We'd welcome help and contribution from community members who know the intricacies of Apache Spark better than we do.As with all dbt Core functionality, we believe the capability can be extended to other databases / engines that offer built-in support for batch Python transformations. We'd be excited to support maintainers of other adapters who want to add similar capabilities on other platforms.
It's likely that this functionality will differ across adapters, for a few reasons:
We'll do our best to offer shared abstractions where they're useful, and call out what needs to be different in each platform.
The first cut
All the things we're working to figure out now:
.py
files in themodels/
directory.table
orincremental
.tags
, a customschema
, or otherwise. They can be described just like SQL models, run standard tests (written in SQL), and show up in the auto-generated documentation site with appropriate lineage.For later:
What isn't it?
First, I don't see dbt as the ideal development experience for Python-language data exploration. At least, not initially. dbt is not an interactive notebook, and it won't be running Python code locally—it will, just like SQL, ship Python code off to the warehouse as its execution environment.
Even more than SQL, development in Python benefits from quick iteration: debugging, previewing, stepping through "fluent" transformations line-by-line. We approximate this in SQL by defining a "fluent" series of CTEs, and previewing the data in each CTE along the way, but there's no notion of sticking a breakpoint partway through and inspecting the attributes of an intermediate dataset.
We should expect this to get better over time, with companion tooling. Once someone has already developed Python code locally, and they’re ready to run it in production—reliably, in DAG order, with testing and documentation and highly visible lineage—they get to deploy it in their dbt project, rather than needing to spin up an external orchestration tool. That’s the feature.
Second, my goal with this feature is not to enable complex materializations / tasks / (web)hooks / operations written in Python. I want to draw a distinction between three types of logic that exist in dbt today:
With this initiative, our primary intent is to support the first type of logic only: transformation logic expressed in Python. The goal is not to add Python-language scripting to the operational toolkit of graph execution, database management, or workflow tooling. The aim isn't to support a more powerful form of
codegen
, or a way to orchestrate other services from—we view those as standalone initiatives, which may intersect with Python-language transformation in interesting ways.To that end, the initial version of Python models won't support more complex materializations. For now, we are aiming for parity with
table
andincremental
, and we're thinking about the right construct to support Python models that actually want to return functions, such as prediction functions. (That may be audf
model materialization, or a new node type dedicated to UDFs. This is work we plan to take up later this year.)For now, that also means that one Python model returns one database object. We think this remains a useful abstraction, in SQL and in Python, with some acknowledged limitations. (Think: sharded tables, lambda views.) Adding support for models that return multiple database objects will remain a separate initiative, requiring other changes within
dbt-core
, and deserving of its own dedicated thinking. Python-language models may prove to be a good conduit for that functionality down the line.I see this project as distinct from the initiative around adding support for "external" nodes, which might run Python-language transformations within external ML automation tools (e.g. Continual), or orchestrate another external service, tightly integrated with the dbt DAG via eventing + metadata. This is an initiative we're exploring separately, with active discussion in #5073, as an important way to support related functionality on databases that lack built-in capabilities for running Python.
We're not building a new way to run tests, although "unit tests" is always a hot topic of conversation, and does feel like even more of an expectation in Python-land. This is something that I'm continuing to discuss and think more about, in the context of the new
pytest
framework that we released for adapter maintainers in v1.1: #4455 (comment).Finally, this feature should never be a new barrier to using dbt. If you're someone who loves SQL, and who uses it every day to solve 100% of your problems—great! We're always going to keep investing in the experience of dbt SQL models. You do not need to know Python to use dbt, and you will not need to learn how to write Python dataframes if you don't want to. If, on the other hand, we manage to convince you that you can and you should, dbt will be ready to meet you where you are.
Very eager to hear your thoughts, concerns, ideas, questions, aspirations, and meditations. Thread them up below!
Beta Was this translation helpful? Give feedback.
All reactions