Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe_dataset_in_correlated_attribute_mode doesn't work in Python 3.11 #40

Open
artemgur opened this issue Sep 4, 2023 · 1 comment

Comments

@artemgur
Copy link

artemgur commented Sep 4, 2023

  • DataSynthesizer version: 0.1.11 (latest)
  • Python version: 3.11
  • Operating System: Windows
  • Pandas version: 1.5.3

Description

In Python 3.11, describe_dataset_in_correlated_attribute_mode raises ValueError. And in Python 3.10, the same code with the same versions of dependencies works correctly.

At the same time, describe_dataset_in_independent_attribute_mode and describe_dataset_in_random_mode work correctly in Python 3.11.

Pandas version is 1.5.3, and not the latest 2.0.3, as describe_dataset_in_correlated_attribute_mode additionally doesn't work with Pandas 2.0.3 (I will write a separate issue on that later).

What I Did

from DataSynthesizer.DataDescriber import DataDescriber

describer = DataDescriber()
describer.describe_dataset_in_correlated_attribute_mode(dataset_file=input_data, k=2, epsilon=0)
describer.save_dataset_description_to_file(description_file)

When the code is ran, following happens:

  1. "================ Constructing Bayesian Network (BN) ================" is printed (at least in Jupyter Notebook)
  2. Following exception is raised: "ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

Traceback:

ValueError                                Traceback (most recent call last)
Cell In[22], line 8
      6 describer = DataDescriber()
      7 #TODO k parameter
----> 8 describer.describe_dataset_in_correlated_attribute_mode(dataset_file=input_data,
      9                                                         k=2,
     10                                                         epsilon=0)
     11                                                         #seed=random_state,
     12                                                         #attribute_to_is_categorical=categorical_attributes)
     13 describer.save_dataset_description_to_file(description_file)

File ~\.virtualenvs\DataSynthesizerTest311\Lib\site-packages\DataSynthesizer\DataDescriber.py:177, in DataDescriber.describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed)
    174 if self.df_encoded.shape[1] < 2:
    175     raise Exception("Correlated Attribute Mode requires at least 2 attributes(i.e., columns) in dataset.")
--> 177 self.bayesian_network = greedy_bayes(self.df_encoded, k, epsilon / 2, seed=seed)
    178 self.data_description['bayesian_network'] = self.bayesian_network
    179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
    180     self.bayesian_network, self.df_encoded, epsilon / 2)

File ~\.virtualenvs\DataSynthesizerTest311\Lib\site-packages\DataSynthesizer\lib\PrivBayes.py:145, in greedy_bayes(dataset, k, epsilon, seed)
    142 attr_to_is_binary = {attr: dataset[attr].unique().size <= 2 for attr in dataset}
    144 print('================ Constructing Bayesian Network (BN) ================')
--> 145 root_attribute = random.choice(dataset.columns)
    146 V = [root_attribute]
    147 rest_attributes = list(dataset.columns)

File C:\Python311\Lib\random.py:369, in Random.choice(self, seq)
    367 def choice(self, seq):
    368     """Choose a random element from a non-empty sequence."""
--> 369     if not seq:
    370         raise IndexError('Cannot choose from an empty sequence')
    371     return seq[self._randbelow(len(seq))]

File ~\.virtualenvs\DataSynthesizerTest311\Lib\site-packages\pandas\core\indexes\base.py:3188, in Index.__nonzero__(self)
   3186 @final
   3187 def __nonzero__(self) -> NoReturn:
-> 3188     raise ValueError(
   3189         f"The truth value of a {type(self).__name__} is ambiguous. "
   3190         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   3191     )

ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
@haoyueping
Copy link
Collaborator

Hi @artemgur, I cannot replicate this error. In your error message, line 145 only raises errors when dataset.columns is empty, i.e., there are no categorical or numerical columns in the input dataset.

--> 145 root_attribute = random.choice(dataset.columns)

Please double-check if this is the case.

DataSynthesizer is just updated to 0.1.12. Please feel free to test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants