[CT-1408] [Feature] Support lineages between sources #6284

andrewcstewart · 2022-10-25T22:59:17Z

andrewcstewart
Oct 25, 2022

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Sources are a great way to bootstrap a dbt project by documenting 'references' to existing tables, but source tables often have lineages relationships which cannot currently be represented in dbt. The source layer is essentially 'flat'. It would be valuable from a primarily documentation perspective if any known lineage between sources could be encoded in their configuration so as to be rendered in the DAG.

The syntax could probably borrow the depends_on property from exposures, as follows:

version: 2

sources:
  - name: my_sources
    tables:
      - name: first_upstream_source
      - name: second_upstream_source
      - name: downstream_source
        depends_on:
          - first_upstream_source
          - second_upstream_source

Instead of rendering the DAG like this:

flowchart LR
   a[first_upstream_source]
   c[downstream_source]
   b[second_uptream_source]
   c --> s1[stg_downstream]
   s1 --> m1[my_model]
   
   style a fill:lightgreen
   style b fill:lightgreen
   style c fill:lightgreen
   style s1 fill:cyan
   style m1 fill:cyan

...it could be rendered instead like this:

flowchart LR
   a[first_upstream_source] --> c[downstream_source]
   b[second_uptream_source] --> c
   c --> s1[stg_downstream]
   s1 --> m1[my_model]
   
   style a fill:lightgreen
   style b fill:lightgreen
   style c fill:lightgreen
   style s1 fill:cyan
   style m1 fill:cyan

Describe alternatives you've considered

One could potentially use the meta property to at least document the relationships between sources:

version: 2

sources:
  - name: my_sources
    tables:
      - name: first_upstream_source
      - name: second_upstream_source
      - name: downstream_source
        meta:
          depends_on:
            - first_upstream_source
            - second_upstream_source

Of course this wouldn't have any functional value.

Who will this benefit?

dbt users who need to document upstream data lineages that are beyond their jurisdiction to reimplement with dbt but still wish to document.

Are you interested in contributing this feature?

Maybe?

Anything else?

No response

jtcohen6 · 2022-10-31T03:34:17Z

jtcohen6
Oct 31, 2022
Maintainer

@andrewcstewart Thanks for opening!

dbt users who need to document upstream data lineages that are beyond their jurisdiction to reimplement with dbt but still wish to document.

Ah! It didn't click for me until this line.

Hm. Hm!

Do you have a concrete example you'd be willing to share? I could imagine documenting entire organizational processes and flows leading up to a single source. For instance, data cleaning & consolidation that takes place between separate tabs in a google sheet, where only the "final" tab is synced into the data warehouse:

graph LR;
    gsheets.finance_team_budget-->gsheets.budget_consolidated;
    gsheets.marketing_team_budget-->gsheets.budget_consolidated;
    gsheets.budget_consolidated-->stg_gsheet_budget;
    style gsheets.finance_team_budget fill:#5fb825,color:white;
    style gsheets.marketing_team_budget fill:#5fb825,color:white;
    style gsheets.budget_consolidated fill:#5fb825,color:white;
    style stg_gsheet_budget fill:#0094b3,color:white;

Is dbt really the right place to model that process, though? And is there much merit in modeling it via the DAG, versus just documenting it as a description on the gsheets.budget_consolidated source? (Wouldn't it be slick if mermaid were supported in the dbt-docs markdown...)

There have also been asks, in the wide-ranging discussion about how to improve exposures (#5106), for the ability to define one exposure as depending on another. Users could thereby represent, within dbt, an entire DAG of relationships that exist outside of dbt.

My hesitation here borders on the ontological: What is a source? What is an exposure? There are two ways of answering these questions:

A source is and can only be a root node, with no parents of its own. An exposure is a leaf node, with no children. A source is the earliest known origination of data; an exposure is its ultimate use. Anything in between — any modification, reformulation, or enrichment of that data — is, by definition, a model.
A source is data before dbt, and an exposure is data after dbt. A model is during dbt; its definition lives in dbt.

I've wavered toward the second answer in the past. I think that has changed. With the discussion afoot in #5073, and our interest in supporting "active" sources and exposures — which can trigger their own ingest and sync/sink, respectively, within the dbt DAG — we shouldn't be thinking about sources as pre-dbt and exposures as post-dbt, but rather as what they are. How source data gets into the data platform / analytical universe (first mover), and where it goes / how it's used downstream — until, of course, it all comes back around, from DAG end to DAG start, data begets decisions begets behavior begets data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1408] [Feature] Support lineages between sources #6284

{{title}}

Replies: 2 comments

{{title}}

Select a reply

[CT-1408] [Feature] Support lineages between sources #6284

andrewcstewart Oct 25, 2022

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

Replies: 2 comments

jtcohen6 Oct 31, 2022 Maintainer

andrewcstewart
Oct 25, 2022

jtcohen6
Oct 31, 2022
Maintainer