Question on PrivBayes.py #16

pandakky · 2019-09-12T03:37:45Z

Thanks for the good work!!
Am looking at the bayesian creation code and have questions regarding line 153-155 and line 111in the PrivBayes.py code:

num_parents = min(len(V), k)
tasks = [(child, V, num_parents, split, dataset) for child, split in
product(rest_attributes, range(len(V) - num_parents + 1))]

What is the rationale behind generating a list of combinations with different split points for each attribute in the rest_attributes list?

It seems like the worker function code can account for all the combinations of attribute and parents pairs in line 111 , just by looking at the entire V for each attribute, instead of iterating all possible V[split:] for each attribute.

haoyueping · 2019-09-15T03:43:25Z

Hi, these tasks will be executed in a parallel manner by multiprocessing.pool.

pandakky · 2019-09-15T14:10:30Z

sorry, i was not clear in my question.
I understand that the tasks will be executed in parallel.
But I am unable to see the rationale behind providing different split point for V to each task, as that would lead to multiple iterations of a particular combination of (child, [parents]) pair across the tasks.
For eg.
k=2
V = [ age, nationality, income]
rest_attrib = [race]
you will have the following set of params for each task:
task 1: race, [age, nationality, income]
task 2: race, [nationality,income]

for each task, there will be different combination of child, [parents] pair in the worker process
task 1: [race, [age, nationality,]] , [race, [age, income]] ,[race, [nationality, income]] ,
task 2: [race, [nationality,income]]

There will be overlap of child, [parents] pair between the last pair in task 1 and the only pair in task 2.

Wouldn't it be more efficient if we create a whole list of combinations of child, [parents] pairs first, before splitting them out to the various tasks for parallel processing?
Or have i missed out something?

haoyueping · 2019-09-26T19:31:28Z

In task 1, [race, [nationality, income]] won't be generated, since one parent must be 'age' due to parents.append(V[split]).

In terms of generating tasks efficiently, the number of (child, parents) pairs is exponential to K (the number of parents), so pre-computing all pairs may cost too much time or memory.

Let m = #rest_attributes, n = |V|, k = #parents. There are about O(m n^k) child-parents pairs in total. The current implementation generates m(n-k+1) tasks.

The drawback of current implementation is that the tasks have significantly different workloads as shown in your example. It is better to have more balanced tasks. Please feel free to make some suggestions on it.

hamzanaeem1999 · 2021-04-02T10:36:03Z

@haoyueping Is it possible that Data Synthesizer library automatically choose the value of k , and the epsilon and all the necessary hyper parameters on its own , so that we have not to tune the parameters !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on PrivBayes.py #16

Question on PrivBayes.py #16

pandakky commented Sep 12, 2019

haoyueping commented Sep 15, 2019

pandakky commented Sep 15, 2019 •

edited

Loading

haoyueping commented Sep 26, 2019

hamzanaeem1999 commented Apr 2, 2021

Question on PrivBayes.py #16

Question on PrivBayes.py #16

Comments

pandakky commented Sep 12, 2019

haoyueping commented Sep 15, 2019

pandakky commented Sep 15, 2019 • edited Loading

haoyueping commented Sep 26, 2019

hamzanaeem1999 commented Apr 2, 2021

pandakky commented Sep 15, 2019 •

edited

Loading