DBT + SodaSQL: how to manage your data at scale

Repository for Club Cloud workshop on dbt + SodaSQL. If you want to checkout the session, here's the youtube recording: link

Runbook

This runbook will guide you through the setup required to make this repo run in your own environments.

GCP setup

Create a GCP project (if you don't have one already)
Enable the BigQuery API
Create a Service Account with at least the following two roles:
- BigQuery Data Editor
- BigQery User
Download the json key file for later

Additionally, if you want to try this repository out using the exact same data, you can:

Enable the BigQuery Transfer Service
Create a destination dataset for your BigQuery Public Dataset (keep the same name if possible)
Then run the following command

bq mk --transfer_config --project_id=your-project-id --data_source=cross_region_copy --target_dataset=your-target-dataset --display_name='description' --params='{"source_dataset_id":"your-source-dataset","source_project_id":"your-project-id","overwrite_destination_table":"true"}'

Optionally you can use the UI for this copy job, but it did not work for me. Link here

dbt setup

Install dbt in your python environment (you can use conda or venv to manage this). If you want to follow the same specs used in this demo, then just run the following:

pip install -r requirements.txt

If you copied this repo, there's already a profiles.yml file that will work on the CI/CD pipeline under .github/workflows/ad_hoc.yml
Locally, you can add a new profile to your dbt default profiles.yml, which is usually located in the .dbt folder. Mine is located in /Users/guillermosanchez/.dbt. The profile should be the same as the one here, but changing the keyfile path to your local keyfile.json path.
If you run dbt debug you can confirm that the setup is working.

Soda-sql setup

Installation should already be ok if you run the first command of the previous block of this README
Authentication is provided in every warehouse.yml file (in our case this is <name_of_dataset>.yml). At the time me and my colleague Jovan made this demo, the only stable way we found was to provide Environment variables with the API_KEY_ID and API_KEY_SECRET, which you can get from the same keyfile.json used in the dbt setup.
You can generate basic scan files out of each <name_of_dataset>.yml by running:

soda analyze <name_of_dataset>.yml

This files have already been generated with the current datasets created by the dbt project in this repository.

If you have already created a dataset, and transfer some data (maybe even dbt run the dbt project), then you can test running a scan:

soda scan path/to/warehouse.yml path/to/scan.yml

or

python soda/trigger_soda.py --dataset <name_of_dataset> --tables <table_names_separated_by_comma>

Soda-cloud setup

Open a trial account at https://cloud.soda.io/
Go to Profile>API Keys and click on the + button to add a new API key.
You can set these API keys locally as environment variables if you want to run soda locally and collect the results in the cloud!

export API_KEY_ID=<soda_key_id>
export API_KEY_SECRET=<soda_key_secret>

CI/CD pipeline

If you want to run the end to end set-up, you can do so from the .github/workflows/ad_hoc.yml pipeline. For this, you need to make sure to setup the following Github Secrets:

BIGQUERY_SERVICE_ACCOUNT: which contains the full content of the keyfile.json
API_KEY_ID: your soda cloud API Key
API_KEY_SECRET: your soda cloud API secret

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
dbt		dbt
soda		soda
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBT + SodaSQL: how to manage your data at scale

Runbook

GCP setup

dbt setup

Soda-sql setup

Soda-cloud setup

CI/CD pipeline

About

Releases

Packages

Languages

godatadriven/clubcloud_dbt_soda

Folders and files

Latest commit

History

Repository files navigation

DBT + SodaSQL: how to manage your data at scale

Runbook

GCP setup

dbt setup

Soda-sql setup

Soda-cloud setup

CI/CD pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages