Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about the 'adj_mats_orig' #7

Open
bbjy opened this issue Apr 3, 2019 · 3 comments
Open

A question about the 'adj_mats_orig' #7

bbjy opened this issue Apr 3, 2019 · 3 comments

Comments

@bbjy
Copy link

bbjy commented Apr 3, 2019

@marinkaz Thank you so much for your work! Would you please explain to me why the adj_mats_orig contains both gene_adj and gene_adj.transpose (the same to drug_drug_adj_list)? I think that a 'gene_adj' is enough, as the code below.
Looking forward to your reply. Thank you!
` #data representation

adj_mats_orig = {
(0, 0): [gene_adj, gene_adj.transpose(copy=True)],
(0, 1): [gene_drug_adj],
(1, 0): [drug_gene_adj],
(1, 1): drug_drug_adj_list + [x.transpose(copy=True) for x in drug_drug_adj_list],
}`

`#In my view, the data representation should be as follows.

adj_mats_orig = {
(0, 0): [gene_adj],
(0, 1): [gene_drug_adj],
(1, 0): [drug_gene_adj],
(1, 1): drug_drug_adj_list,
}`

@hurleyLi
Copy link

hurleyLi commented Aug 8, 2019

I have the same question. I think this will cause a huge data leakage problem in your training, because your validation and test set is created independently for gene_adj and gene_adj.transpose(copy=True), and therefore the edges from the validation / test set in gene_adj is actually included in the training set of gene_adj.transpose(copy=True).

Same problem goes for the train / validate set between gene_drug_adj and drug_gene_adj. The validation edges from gene_drug_adj are actually used for training in drug_gene_adj, and vise versa.

Could you please clarify?
Thanks!

@zch42
Copy link

zch42 commented Aug 19, 2019

@bbjy I guess the author would like to use adj and adj.transpose to represent undirect graphs for PPI network and drug-drug network. For the interactions between proteins and drugs, the information flow is represented with bipartite graphs.

@zch42
Copy link

zch42 commented Aug 19, 2019

@hurleyLi I am also confused on the problem you mentioned.
Here are the source codes for spliting training/val/testing edges:

    def mask_test_edges(self, edge_type, type_idx):
        edges_all, _, _ = preprocessing.sparse_to_tuple(self.adj_mats[edge_type][type_idx])
        num_test = max(50, int(np.floor(edges_all.shape[0] * self.val_test_size)))
        num_val = max(50, int(np.floor(edges_all.shape[0] * self.val_test_size)))

        all_edge_idx = list(range(edges_all.shape[0]))
        np.random.shuffle(all_edge_idx)

        val_edge_idx = all_edge_idx[:num_val]
        val_edges = edges_all[val_edge_idx]

        test_edge_idx = all_edge_idx[num_val:(num_val + num_test)]
        test_edges = edges_all[test_edge_idx]

        train_edges = np.delete(edges_all, np.hstack([test_edge_idx, val_edge_idx]), axis=0)

It seems that the author splits the edges independently for different type_idx, which will cause training and cross validation overlap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants