[CT-1408] [Feature] Support lineages between sources #6284
Replies: 2 comments
-
@andrewcstewart Thanks for opening!
Ah! It didn't click for me until this line. Hm. Hm! Do you have a concrete example you'd be willing to share? I could imagine documenting entire organizational processes and flows leading up to a single source. For instance, data cleaning & consolidation that takes place between separate tabs in a google sheet, where only the "final" tab is synced into the data warehouse: graph LR;
gsheets.finance_team_budget-->gsheets.budget_consolidated;
gsheets.marketing_team_budget-->gsheets.budget_consolidated;
gsheets.budget_consolidated-->stg_gsheet_budget;
style gsheets.finance_team_budget fill:#5fb825,color:white;
style gsheets.marketing_team_budget fill:#5fb825,color:white;
style gsheets.budget_consolidated fill:#5fb825,color:white;
style stg_gsheet_budget fill:#0094b3,color:white;
Is dbt really the right place to model that process, though? And is there much merit in modeling it via the DAG, versus just documenting it as a There have also been asks, in the wide-ranging discussion about how to improve exposures (#5106), for the ability to define one exposure as depending on another. Users could thereby represent, within dbt, an entire DAG of relationships that exist outside of dbt. My hesitation here borders on the ontological: What is a source? What is an exposure? There are two ways of answering these questions:
I've wavered toward the second answer in the past. I think that has changed. With the discussion afoot in #5073, and our interest in supporting "active" sources and exposures — which can trigger their own ingest and sync/sink, respectively, within the dbt DAG — we shouldn't be thinking about sources as pre-dbt and exposures as post-dbt, but rather as what they are. How source data gets into the data platform / analytical universe (first mover), and where it goes / how it's used downstream — until, of course, it all comes back around, from DAG end to DAG start, data begets decisions begets behavior begets data. |
Beta Was this translation helpful? Give feedback.
-
Is this your first time submitting a feature request?
Describe the feature
Sources are a great way to bootstrap a dbt project by documenting 'references' to existing tables, but source tables often have lineages relationships which cannot currently be represented in dbt. The source layer is essentially 'flat'. It would be valuable from a primarily documentation perspective if any known lineage between sources could be encoded in their configuration so as to be rendered in the DAG.
The syntax could probably borrow the
depends_on
property fromexposures
, as follows:Instead of rendering the DAG like this:
...it could be rendered instead like this:
Describe alternatives you've considered
One could potentially use the
meta
property to at least document the relationships between sources:Of course this wouldn't have any functional value.
Who will this benefit?
dbt users who need to document upstream data lineages that are beyond their jurisdiction to reimplement with dbt but still wish to document.
Are you interested in contributing this feature?
Maybe?
Anything else?
No response
Beta Was this translation helpful? Give feedback.
All reactions