Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get SCITE Most-Likely Tree #141

Open
gordonkoehn opened this issue Aug 6, 2023 · 10 comments
Open

Get SCITE Most-Likely Tree #141

gordonkoehn opened this issue Aug 6, 2023 · 10 comments

Comments

@gordonkoehn
Copy link
Collaborator

gordonkoehn commented Aug 6, 2023

SCITE is not complete yet. Currently, the final step to select the most likely (ML), i.e. the most frequent tree in the mcmc samples, is missing.
To my understanding, this requires comparing all N trees in the mcmc with all N trees, $N^2$ comparisons. This should return a symmetric NxN boolean matrix. Here, one can simply calculate the row or column sums. The highest sum index corresponds to the highest likelihood tree.

Is there a smarter way ?

At best this is encapsulated in a function taking the mcmc samples, returning the ML tree along with the log prob.

Later one may devise a snakemake rule to add this to mcmc workflows and also calculate distances as well and plot.

This is not required for any of the currently devised experiments but should be added at some. point for a full version of SCITE!

@gordonkoehn
Copy link
Collaborator Author

The function to compare NxN trees is already implemented, see distances.calculate_distance_matrix

@gordonkoehn
Copy link
Collaborator Author

@pawel-czyz I just, realized we may want to complete SCITE at some point :)

@pawel-czyz
Copy link
Member

To get the maximum likelihood (ML) tree from the samples the simplest solution would be to just take:

ml_tree_index = np.argmax(loglikelihood_array)
ml_tree = samples[ml_tree_index]

However, this from the samples can be problematic: generally the samples don't contain the ML solution (this is especially evident in high-dimensional Euclidean spaces. For example, consider the normal distribution $\mathcal N(0, I_d)$ in $d=100$ dimensions. ML solution is at 0 yet the samples from perfect Monte Carlo will be quite far from there – $\chi^2(d)$ distribution can provide intuition why).

For discrete spaces this behavior may be somewhat different and I remember that Jack had some encouraging results comparing MCMC to optimization methods like hill climbing, but I don't remember the details...

We can discuss it further, but my guess is that it's not that high priority right now, as we have seen evidence that the posterior can be multimodal, so SMC sampler should be a better choice than a single ML tree. (Another way would be to try to find different modes by plotting local ML maxima.)

@gordonkoehn
Copy link
Collaborator Author

Yes - I think I understand the concern. - How is this done in the current version of SCITE then? - am interested to understand this?

Absolutely, low priority.

@pawel-czyz
Copy link
Member

pawel-czyz commented Aug 8, 2023

If I recall properly a conversation with Jack, they use simulated annealing for MAP estimation, but I'm not sure...

@gordonkoehn
Copy link
Collaborator Author

Why would simulated annealing be better here? - Isn't that used when finding the optimal solution would take too long?

@pawel-czyz
Copy link
Member

The primary point here is that MCMC may never visit the MLE point (at least in continuous domains. For discrete I think the situation may be different), so optimization is preferable to find it. And one way of doing optimization is using simulated annealing, or hill climbing, or hill climbing + tempering, etc.

I like simulated annealing mostly because it's working quite well even in multimodal problems. However, as we don't seem to have these issues, hill climbing (with multiple restarts) may be a viable strategy (or it's possible that MCMC will find MLE tree as well).

@gordonkoehn
Copy link
Collaborator Author

Thank you for the context, but I still don't fully understand why the fact that the MLE point may not be visited is a direct argument for optimization vs. arg max likelihood. Does the optimization case assume that states neighbouring to the MLE are also highly likely, trying to avoid a single more likely outlier? - Perhaps something for our meeting on Wednesday. Any Betancourt blogs on this ?

@pawel-czyz
Copy link
Member

pawel-czyz commented Sep 12, 2023

I remember a paper, but I can't find it. The first 15 minutes of this explain why typical set (explored MCMC) may not find MAP.

Also, there's a simple experiment: try sampling from $\mathcal N(0, I_K)$ for $K\in {1, 10, 100}$. How many sampled data points have PDFs of 99% of the value at 0?

(All of the above holds in continuous spaces. In tree spaces the behaviour may be different.)

@gordonkoehn
Copy link
Collaborator Author

Watched it :) Thank you - yes, I understand this argument or the typical set may not include the MAP. For a typical set in 2D: Is the idea then that argmax will return a MAP from the circle around the mode and simulated annealing may venture to jump somewhere in the circle / i.e. on the mode when the temperature lowers, giving us a chance to find the true mode. Let's draw this tomorrow - super interesting :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants