Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on respecting missing_rate in DataGenerator #25

Open
raids opened this issue Jul 20, 2020 · 0 comments
Open

Thoughts on respecting missing_rate in DataGenerator #25

raids opened this issue Jul 20, 2020 · 0 comments

Comments

@raids
Copy link

raids commented Jul 20, 2020

Hi, firstly, I haven't said it so far, but thanks for creating and maintaining DataSynthesizer! It's a useful tool.

DataDescriber creates a value missing_rate in the attribute descriptions. I was wondering what your thoughts are on making use of these values in DataGenerator along with the distribution bins which are already used.

My use case is pretty simple, I want to create a synthesised data set for non-production use which is as representative of the original data set as possible. Two extremes of the problem I'm having:

  • Columns which are mostly populated with values but with some null records result in no null values at all in the synthesised data set
  • Columns which are mostly null but may have a very small number of records populated with a value will result in all records being set to that value and no null values in the synthesised data set

In some instances where it's more important for me, I have addressed this in pre and post-processing steps myself, but as DataDescriber collects this metric, I was wondering if it would be reasonable to implement this in DataSynthesizer itself, perhaps as an option passed to the relevant generator methods.

Cheers!

@raids raids changed the title Thought on respecting missing_rate in DataGenerator Thoughts on respecting missing_rate in DataGenerator Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant