Configuring + managing `schemas` #5781

jtcohen6 · 2022-09-07T09:32:46Z

jtcohen6
Sep 7, 2022
Maintainer

A more specific version of #5099

Philosophical premise

I am less inclined nowadays to say dbt should aspire to be "Terraform for data warehouses." Rather than a fully generic and extensible framework for managing data warehouse objects, where we take the existence of those object types (tables, schemas, tasks, streams, roles, UDFs, ...) as our starting point, I think we should always take jobs-to-be-done in data as our starting point (models, sources, exposures, tests, descriptions, granting access, ...), and treat data warehouse objects as implementation details / backend mechanisms for achieving those jobs.

Ok, but we need schemas

paraphrasing from dbt-labs/dbt-bigquery#289 (comment)

We're missing a mechanism for users to truly configure schemas/datasets in dbt today. We can agree on that, without necessarily implying that dbt should first / eventually become a totally generic and extensible framework for managing all types of database objects.

dbt is already in the business of creating schemas/datasets, as soon as you define and run a model in a schema/dataset that does not yet exist:

dbt-core/core/dbt/task/run.py

Line 423 in e95e36d

self.create_schemas(adapter, required_schemas)

This has come up in the context of labels (dbt-labs/dbt-bigquery#22), persisting descriptions/comments (#1714), grants (dev blog), and managing "orphaned" objects (#4957).

# dbt_project.yml ?? I'd really prefer not to keep overloading this file though

databases: # 'projects'? 'catalogs'? what about data platforms that don't support this third level?
  - name: my_database
    schemas: # or 'datasets'
      - name: important_schema  # actual schema name? or schema config --> generate_schema_name?
        description: This is my very important schema
        persist_docs: true  # add this as a comment
        grants:
          usage: ['role_a', 'group_b', 'user_c']
        manage_action: drop # drop orphaned objects in this schema
        # adapter specific configs
        hours_to_expiration: 3 # default for all tables created in this schema
        ...

Big ideas:

When dbt runs create schema, it would take adapter-specific configs into account, persist the description as a comment, and so on.
dbt would manage grants on the schema just as it would for views/tables. If it runs create schema for the first time, it adds every grant. If the schema already exists, it runs an introspective query to figure the current grants, calculates diffs, and applies if needed.
If we add a dbt manage task (as proposed in [CT-426] [Feature] Manage full schema content with dbt by dropping tables/views not maintained in models #4957 / Feature CT-426 manage schemas #5392), it would take action to manage (warn or drop) orphaned objects in that schema

Questions:

Should dbt create logical databases if they're configured, and don't already exist? e.g. in Snowflake, where it's low-lift, versus other platforms where the third-level namespace is "real infrastructure"
Should the database + schema names be the actual database/schema names in the database? Or should they match the schema config which is passed into 'generate_schema_name', and naturally differ by environment? I lean toward the latter, though it could be tricky!

Fleid · 2022-09-07T22:03:08Z

Fleid
Sep 7, 2022

I personally think that schemas/datasets are enough of a thing to deserve to be represented by a fully fledged object.
There are enough knobs and settings exposed by all vendors at that level for that. I know this is a fuzzy criteria, we can certainly expand on that ;)

But we need to clearly articulate whether or not this object has any kind of model collection aspect in its semantics in dbt.
Or, if in dbt world, it's a thing in itself with no relationships to models.

Here I'm mostly concerned about the inheritance story between schemas and tables (in the database) versus schema and models (as could be added in the dbt project) versus folders and models (the current dbt way). I'm not concerned about the implementation, or have much doubt about how it should work. I'm concerned about the expectations that users will have, and how it can make the mental model (hu hu) of models more complex to grasp.

Or do you see that happening:

# dbt_project.yml ?? I'd really prefer not to keep overloading this file though

databases: # 'projects'? 'catalogs'? what about data platforms that don't support this third level?
  - name: my_database
    schemas: # or 'datasets'
      - name: important_schema  # actual schema name? or schema config --> generate_schema_name?
      tables:
        - name: myTable1

1 reply

jtcohen6 Sep 22, 2022
Maintainer Author

Good point! I don't think this schemas config wants to have any tables defined within it.

If it's a dbt-managed schema containing models, the tables are defined by those models. All table-level configuration should be defined via model-level configs. This discussion seeks to answer the question: What about schema-level configuration? There's no good place or way to define those today.

If it's a schema containing sources... dbt isn't creating those source tables / schemas!* Then the tables: list config makes sense to me.

So: This should only apply to properties of the schema/dataset. Now, some of those properties might have implications or set defaults for tables created in those schemas, e.g. hours_to_expiration on BigQuery—but the inheritance in that case is managed on the data platform, not within dbt.

I agree that we need to find the right way to present this information so as not to confuse end users, who think this serves more of a purpose than it truly does.

*Possible exceptions to this rule:

dbt-external-tables
A future where dbt has more "active" sources that are triggered to load as part of the DAG. Even then, it's unlikely that dbt would itself be running the queries to create them.

agoblet · 2022-09-22T08:21:48Z

agoblet
Sep 22, 2022

I am thinking about integrating the generate_schema_name logic into #5392 and wonder what that should look like. The default logic takes the target schema if no custom schema is provided, and else concatenates the 2 schemas. In the configuration mentioned above, we list schemas explicitly, and do not rely on the target schema. What's your view on how the generation logic should look like for this? Some options:

Make the schema name field optional. This would be in line with other places where schema names are generated. It does feel a bit odd though not to mention the schema when talking about configuring management for a it.
Compare the listed schema to the target. Then you could still list the schema names, which feels less odd in the schema management configuration. The logic would be different from schema generation logic elsewhere though. Do you take the target schema if the schema mentioned in the configuration is the same, and else concatenate? What if someone created a custom schema foo with a target schema foo, resulting in a foo_foo schema in the database?

@jtcohen6 how do you envision this?

4 replies

jtcohen6 Sep 22, 2022
Maintainer Author

@agoblet This is a really fair question.

The two most common patterns with generate_schema_name are:

Always suffix: "{{ target.schema }}_{{ custom_schema_name }}"
No suffix in dev, fully replace in prod: "{{ custom_schema_name if target.name == 'prod' else target.schema }}"

However, a custom version of generate_schema_name can use any properties of the specific node it wishes. A certain behavior for models, different for tests, different still for snapshots with so_and_so in their tags...

Where is "managing schemas" most useful? Probably production, more than development — especially if folks follow a pattern that makes it easier for them to fully drop + recreate their dev schemas regularly. If we have to make a choice that works better for some environments than others, that's what we should shoot for. Do you agree?

Given all that, there are two reasonable options I see:

Users provide the exact name of the fully resolved schema. If a model is configured with schema: custom_xyz, and based on generate_schema_name logic it lands in a schema named prefix_analytics_custom_xyz_models, the user would need to specify exactly that name (prefix_analytics_custom_xyz_models) in their managed_schemas config. If a different resource is also configured with schema: custom_xyz, but the generate_schema_name logic resolves its final schema to custom_xyz_different, the user would need a separate entry under managed_schemas explicitly matching that schema name.
Users match the initial config, and dbt manages all resolved schema names. Taking the same example as above, the user provides one entry under managed_schemas, for custom_xyz, matching the initial config passed into both resources. dbt uses that schema management logic to manage all resolved forms of that schema name: prefix_analytics_custom_xyz_models, custom_xyz_different.

I lean toward the second option, because it feels like the most elegant way to handle env-specific logic. If you update your generate_schema_name logic, you don't need to also go update all your schema management configs. We're making the most of our abstractions, rather than tying ourselves to hardcoded / "raw" database references.

Mechanically, this would look like:

Using the manifest, create a mapping between all node.config.schema (user-provided) → node.schema (resolved generate_schema_name logic, which already happened during project parsing)
Using the schema names in manage_schemas, match the user-provided node.config.schema, and map management behavior onto resolved node.schema

The big risk: multiple input schema configs could resolve to the same ultimate name. This would happen all the time in development, following the alternate generate_schema_name pattern we document. Our options are:

Let one of the manage_schemas configs "win" — some additional config?
Raise a warning or exception, and take no further action

dbt found two manage_schema entries for the same resolved schema:
- model_a has configured 'custom_schema_xyz' -> 'dev_jerco'
- model_a has configured 'custom_schema_abc' -> 'dev_jerco'

agoblet Sep 22, 2022

If we have to make a choice that works better for some environments than others, that's what we should shoot for. Do you agree?

Agreed.

Using the manifest, create a mapping between all node.config.schema (user-provided) → node.schema (resolved generate_schema_name logic, which already happened during project parsing)

Using the schema names in manage_schemas, match the user-provided node.config.schema, and map management behavior onto resolved node.schema

So then we need to make the schema in manage_schemas optional, since node.config.schema is optional too. I think that that is fine actually, as it represents the default / target schema. Right?

jtcohen6 Sep 22, 2022
Maintainer Author

Good point. If node.config.schema is not specified, it will most likely resolve to target.schema (though wacky generate_schema_name logic could have other plans). So, some way to specify the "no config" option within manage_schemas? default, none, {{ target.schema }}...?

agoblet Sep 23, 2022

Implemented this in #5392. The schema field is optional. When left out, DBT will send None through the generate_schema_name macro. This should yield target.schema for most projects, but should work fine for custom macros.

dataders · 2023-09-27T15:52:18Z

dataders
Sep 27, 2023
Maintainer

Flagging that one aspect of schema management, USAGE privilege management has surfaced in dbt-labs/dbt-snowflake#715 and #8586. Thanks @chwiese!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuring + managing `schemas` #5781

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Configuring + managing schemas #5781

jtcohen6 Sep 7, 2022 Maintainer

Philosophical premise

Ok, but we need schemas

Replies: 3 comments · 5 replies

Fleid Sep 7, 2022

jtcohen6 Sep 22, 2022 Maintainer Author

agoblet Sep 22, 2022

jtcohen6 Sep 22, 2022 Maintainer Author

agoblet Sep 22, 2022

jtcohen6 Sep 22, 2022 Maintainer Author

agoblet Sep 23, 2022

dataders Sep 27, 2023 Maintainer

Configuring + managing `schemas` #5781

jtcohen6
Sep 7, 2022
Maintainer

Replies: 3 comments 5 replies

Fleid
Sep 7, 2022

jtcohen6 Sep 22, 2022
Maintainer Author

agoblet
Sep 22, 2022

jtcohen6 Sep 22, 2022
Maintainer Author

jtcohen6 Sep 22, 2022
Maintainer Author

dataders
Sep 27, 2023
Maintainer