Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on PrivBayes.py #16

Open
pandakky opened this issue Sep 12, 2019 · 4 comments
Open

Question on PrivBayes.py #16

pandakky opened this issue Sep 12, 2019 · 4 comments

Comments

@pandakky
Copy link

Thanks for the good work!!
Am looking at the bayesian creation code and have questions regarding line 153-155 and line 111in the PrivBayes.py code:

num_parents = min(len(V), k)
tasks = [(child, V, num_parents, split, dataset) for child, split in
product(rest_attributes, range(len(V) - num_parents + 1))]

What is the rationale behind generating a list of combinations with different split points for each attribute in the rest_attributes list?

It seems like the worker function code can account for all the combinations of attribute and parents pairs in line 111 , just by looking at the entire V for each attribute, instead of iterating all possible V[split:] for each attribute.

@haoyueping
Copy link
Collaborator

Hi, these tasks will be executed in a parallel manner by multiprocessing.pool.

@pandakky
Copy link
Author

pandakky commented Sep 15, 2019

sorry, i was not clear in my question.
I understand that the tasks will be executed in parallel.
But I am unable to see the rationale behind providing different split point for V to each task, as that would lead to multiple iterations of a particular combination of (child, [parents]) pair across the tasks.
For eg.
k=2
V = [ age, nationality, income]
rest_attrib = [race]
you will have the following set of params for each task:
task 1: race, [age, nationality, income]
task 2: race, [nationality,income]

for each task, there will be different combination of child, [parents] pair in the worker process
task 1: [race, [age, nationality,]] , [race, [age, income]] ,[race, [nationality, income]] ,
task 2: [race, [nationality,income]]

There will be overlap of child, [parents] pair between the last pair in task 1 and the only pair in task 2.

Wouldn't it be more efficient if we create a whole list of combinations of child, [parents] pairs first, before splitting them out to the various tasks for parallel processing?
Or have i missed out something?

@haoyueping
Copy link
Collaborator

In task 1, [race, [nationality, income]] won't be generated, since one parent must be 'age' due to parents.append(V[split]).

In terms of generating tasks efficiently, the number of (child, parents) pairs is exponential to K (the number of parents), so pre-computing all pairs may cost too much time or memory.

Let m = #rest_attributes, n = |V|, k = #parents. There are about O(m n^k) child-parents pairs in total. The current implementation generates m(n-k+1) tasks.

The drawback of current implementation is that the tasks have significantly different workloads as shown in your example. It is better to have more balanced tasks. Please feel free to make some suggestions on it.

@hamzanaeem1999
Copy link

@haoyueping Is it possible that Data Synthesizer library automatically choose the value of k , and the epsilon and all the necessary hyper parameters on its own , so that we have not to tune the parameters !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants