You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, firstly, I haven't said it so far, but thanks for creating and maintaining DataSynthesizer! It's a useful tool.
DataDescriber creates a value missing_rate in the attribute descriptions. I was wondering what your thoughts are on making use of these values in DataGenerator along with the distribution bins which are already used.
My use case is pretty simple, I want to create a synthesised data set for non-production use which is as representative of the original data set as possible. Two extremes of the problem I'm having:
Columns which are mostly populated with values but with some null records result in no null values at all in the synthesised data set
Columns which are mostly null but may have a very small number of records populated with a value will result in all records being set to that value and no null values in the synthesised data set
In some instances where it's more important for me, I have addressed this in pre and post-processing steps myself, but as DataDescriber collects this metric, I was wondering if it would be reasonable to implement this in DataSynthesizer itself, perhaps as an option passed to the relevant generator methods.
Cheers!
The text was updated successfully, but these errors were encountered:
raids
changed the title
Thought on respecting missing_rate in DataGenerator
Thoughts on respecting missing_rate in DataGenerator
Jul 20, 2020
Hi, firstly, I haven't said it so far, but thanks for creating and maintaining DataSynthesizer! It's a useful tool.
DataDescriber
creates a valuemissing_rate
in the attribute descriptions. I was wondering what your thoughts are on making use of these values inDataGenerator
along with the distribution bins which are already used.My use case is pretty simple, I want to create a synthesised data set for non-production use which is as representative of the original data set as possible. Two extremes of the problem I'm having:
In some instances where it's more important for me, I have addressed this in pre and post-processing steps myself, but as
DataDescriber
collects this metric, I was wondering if it would be reasonable to implement this in DataSynthesizer itself, perhaps as an option passed to the relevant generator methods.Cheers!
The text was updated successfully, but these errors were encountered: