Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: partial distribution produced by get_noisy_distribution_of_attributes #26

Open
zjroth opened this issue Sep 9, 2020 · 2 comments

Comments

@zjroth
Copy link

zjroth commented Sep 9, 2020

  • DataSynthesizer version: 0.1.2

Description

The function get_noisy_distribution_of_attributes only gets a partial distribution. This bug was introduced in commit 1abe702. Here is the relevant code as it appears in master (currently commit be8b65a):

full_space = None
for item in grouper_it(products, 1000000):
    if full_space is None:
        full_space = DataFrame(columns=attributes, data=list(item))
    else:
        data_frame_append = DataFrame(columns=attributes, data=list(item))
        full_space.append(data_frame_append)

In particular, full_space.append does not modify full_space; instead, it returns a new object. (This seems to be true for all versions of pandas.) As a result, full_space does not store all of the intended rows but, rather, only at most the first 1000000.

haoyueping added a commit that referenced this issue Sep 14, 2020
@haoyueping
Copy link
Collaborator

Hi zjroth, thanks for your feedback. This bug is now fixed by commit 1ced27c

@zjroth
Copy link
Author

zjroth commented Sep 22, 2020

Thanks for the quick response and fix. It's worth noting that the bug was introduced in an attempt to fix memory issues (commit 1abe702). It's clear why this initial "fix" would have reduced memory consumption: No more than two million rows would ever be loaded at the same time. However, if I'm reading the code correctly, that is no longer the case with this new commit (1ced27c). As such, the attempt to reduce memory consumption may need to be revisited. (To be clear, this is not currently an issue for me.)

I just wanted to mention this in case a new issue needs to be created to resolve potential memory issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants