Synthetic data generation with DuckDB
Decoy is a Synthetic Data Generator using DuckDB at it's core. The main aim of the project is to reduce the overhead in creating synthetic data and provide all the tools to easily test your synthetic data queries.
It has integrated well known Python based synthetic data generators such as Faker, Mimesis and Random.
The power of this application comes from the ability to write scripts to repeatably generate data, to use interactive tools to test your queries and the ability to create tables which interact, allowing for the creation of full relational databases.
This tool generates data randomly, it does not require access to the real data and it does not learn from real data. This is a deliberate design decision, to ensure that the data generated by this tool can be used safely in non-production environments.
The only prior information it requires to make analoguous synthetic data is the names of the tables in the database, the column names and any relationships between the tables.
If you require synthetic data which mimics real data distributions to the point where you can do analytics or machine learning, this is not the tool you want!
#Requires Python 3.10+
pip install --editable .
OR
#Requires Python 3.10+
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
Decoy exposes multiple command line tools to generate data and test generating data
The main core of decoy is the ability to repeatibly generate synthetic data from a .sql file.
decoy exec examples/basic.sql
There are multiple examples in the ./examples
folder.
In addtion to a CLI to run .sql files. Decoy also provides an interactive environment to easily build your queries and quickly test that they are working correctly.
Jupyter Lab is a web-based interactive repl for which we have created a decoy kernel allowing you to run decoy SQL queries and quickly observe their output
#install the kernel for first-time use
decoy kernel-install
#Running the Jupyter lab web server
jupyter lab
Then select decoy
as the kernel when starting a new notebook instance.
Some of the files in ./examples
are in the *.ipynb
, these are to be read using the Jupyter kernel. Jupyter also integrates markdown too - so if required for documentation, you can also explain though processes.
Decoy also includes a command-line REPL too which provides the same interactive testing environment but on the command line.
decoy repl
Please not that the query must end with a ;
and to execute the query select <Alt/Opt><Enter>
as multiline queires are supported.
./cli sqlparse './examples/test_schema.sql' './examples/schema_parse_output.sql' 5
./cli exec ./examples/schema_parse_output.sql
** At the moment the kernel only works if the notebook is in the root of this project. Something needs fixing with the kernel path.
The python Faker library is exposed for the en-GB locale, the registration function can easily be modified to include more locales.
select faker_name() from range(20);
The python Mimesis library is exposed for the en-GB locale, the registration function can easily be modified to include more locales.
select mimesis_person_full_name() from range(20);
There is a reverse regex function for generating text patterns.
SELECT xeger('\(\+0[0-9]{2}\) 07[0-9]{3} [0-9]{6}') as tel FROM range(20);
Functions are also available for intra-table sampling, column shuffling, oversampling from other tables.
-- Create a names table
CREATE TABLE name_table AS SELECT range as id, faker_name() as name FROM range(1000);
-- Select 10k names randomly from the name table.
SELECT oversample('name_table', 'name') from range(10000);
SELECT name, shuffle(name) as name_shuffled, intratable_sample(name) as name_sampled FROM name_table;
You can also generate imperfect data by using some of the data messer functions which will either add NULL
or junk data. These are currently hardcoded and the functions are in ./decoy/udf_arrow.py
CREATE OR REPLACE TABLE names_full AS (
SELECT
faker_name() as full_name,
FROM range(30)
);
UPDATE names_full SET full_name = messy_data_nullifier(full_name);
DuckDB handles multiple export functions. Including CSV and Parquet
COPY (SELECT faker_name() FROM range(5)) TO 'export_test.csv' WITH (HEADER 1, DELIMITER ',');
COPY (SELECT faker_name() FROM range(5)) TO 'export_parquet.parquet';
PYTHONPATH=. pytest