Thoughts on respecting missing_rate in DataGenerator #25

raids · 2020-07-20T10:29:48Z

Hi, firstly, I haven't said it so far, but thanks for creating and maintaining DataSynthesizer! It's a useful tool.

DataDescriber creates a value missing_rate in the attribute descriptions. I was wondering what your thoughts are on making use of these values in DataGenerator along with the distribution bins which are already used.

My use case is pretty simple, I want to create a synthesised data set for non-production use which is as representative of the original data set as possible. Two extremes of the problem I'm having:

Columns which are mostly populated with values but with some null records result in no null values at all in the synthesised data set
Columns which are mostly null but may have a very small number of records populated with a value will result in all records being set to that value and no null values in the synthesised data set

In some instances where it's more important for me, I have addressed this in pre and post-processing steps myself, but as DataDescriber collects this metric, I was wondering if it would be reasonable to implement this in DataSynthesizer itself, perhaps as an option passed to the relevant generator methods.

Cheers!

The text was updated successfully, but these errors were encountered:

raids changed the title ~~Thought on respecting missing_rate in DataGenerator~~ Thoughts on respecting missing_rate in DataGenerator Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on respecting missing_rate in DataGenerator #25

Thoughts on respecting missing_rate in DataGenerator #25

raids commented Jul 20, 2020

Thoughts on respecting missing_rate in DataGenerator #25

Thoughts on respecting missing_rate in DataGenerator #25

Comments

raids commented Jul 20, 2020