Get SCITE Most-Likely Tree #141

gordonkoehn · 2023-08-06T07:47:10Z

SCITE is not complete yet. Currently, the final step to select the most likely (ML), i.e. the most frequent tree in the mcmc samples, is missing.
To my understanding, this requires comparing all N trees in the mcmc with all N trees, $N^2$ comparisons. This should return a symmetric NxN boolean matrix. Here, one can simply calculate the row or column sums. The highest sum index corresponds to the highest likelihood tree.

Is there a smarter way ?

At best this is encapsulated in a function taking the mcmc samples, returning the ML tree along with the log prob.

Later one may devise a snakemake rule to add this to mcmc workflows and also calculate distances as well and plot.

This is not required for any of the currently devised experiments but should be added at some. point for a full version of SCITE!

gordonkoehn · 2023-08-06T07:48:25Z

The function to compare NxN trees is already implemented, see distances.calculate_distance_matrix

gordonkoehn · 2023-08-06T07:50:20Z

@pawel-czyz I just, realized we may want to complete SCITE at some point :)

pawel-czyz · 2023-08-06T18:08:14Z

To get the maximum likelihood (ML) tree from the samples the simplest solution would be to just take:

ml_tree_index = np.argmax(loglikelihood_array)
ml_tree = samples[ml_tree_index]

However, this from the samples can be problematic: generally the samples don't contain the ML solution (this is especially evident in high-dimensional Euclidean spaces. For example, consider the normal distribution $\mathcal N(0, I_d)$ in $d=100$ dimensions. ML solution is at 0 yet the samples from perfect Monte Carlo will be quite far from there – $\chi^2(d)$ distribution can provide intuition why).

For discrete spaces this behavior may be somewhat different and I remember that Jack had some encouraging results comparing MCMC to optimization methods like hill climbing, but I don't remember the details...

We can discuss it further, but my guess is that it's not that high priority right now, as we have seen evidence that the posterior can be multimodal, so SMC sampler should be a better choice than a single ML tree. (Another way would be to try to find different modes by plotting local ML maxima.)

gordonkoehn · 2023-08-08T08:56:18Z

Yes - I think I understand the concern. - How is this done in the current version of SCITE then? - am interested to understand this?

Absolutely, low priority.

pawel-czyz · 2023-08-08T19:48:27Z

If I recall properly a conversation with Jack, they use simulated annealing for MAP estimation, but I'm not sure...

gordonkoehn · 2023-09-08T09:57:21Z

Why would simulated annealing be better here? - Isn't that used when finding the optimal solution would take too long?

pawel-czyz · 2023-09-09T19:31:29Z

The primary point here is that MCMC may never visit the MLE point (at least in continuous domains. For discrete I think the situation may be different), so optimization is preferable to find it. And one way of doing optimization is using simulated annealing, or hill climbing, or hill climbing + tempering, etc.

I like simulated annealing mostly because it's working quite well even in multimodal problems. However, as we don't seem to have these issues, hill climbing (with multiple restarts) may be a viable strategy (or it's possible that MCMC will find MLE tree as well).

gordonkoehn · 2023-09-11T11:07:46Z

Thank you for the context, but I still don't fully understand why the fact that the MLE point may not be visited is a direct argument for optimization vs. arg max likelihood. Does the optimization case assume that states neighbouring to the MLE are also highly likely, trying to avoid a single more likely outlier? - Perhaps something for our meeting on Wednesday. Any Betancourt blogs on this ?

pawel-czyz · 2023-09-12T06:11:29Z

I remember a paper, but I can't find it. The first 15 minutes of this explain why typical set (explored MCMC) may not find MAP.

Also, there's a simple experiment: try sampling from $\mathcal N(0, I_K)$ for $K\in {1, 10, 100}$. How many sampled data points have PDFs of 99% of the value at 0?

(All of the above holds in continuous spaces. In tree spaces the behaviour may be different.)

gordonkoehn · 2023-09-12T08:49:49Z

Watched it :) Thank you - yes, I understand this argument or the typical set may not include the MAP. For a typical set in 2D: Is the idea then that argmax will return a MAP from the circle around the mode and simulated annealing may venture to jump somewhere in the circle / i.e. on the mode when the temperature lowers, giving us a chance to find the true mode. Let's draw this tomorrow - super interesting :)

gordonkoehn added 🦾 action: enhancement New feature or request ❗priority: medium ⏰ time estimate: M labels Aug 6, 2023

gordonkoehn added ❗priority: low and removed ❗priority: medium labels Sep 27, 2023

gordonkoehn mentioned this issue Oct 18, 2023

Roadmap to Real-World Data Usage #194

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get SCITE Most-Likely Tree #141

Get SCITE Most-Likely Tree #141

gordonkoehn commented Aug 6, 2023 •

edited

Loading

gordonkoehn commented Aug 6, 2023

gordonkoehn commented Aug 6, 2023

pawel-czyz commented Aug 6, 2023

gordonkoehn commented Aug 8, 2023

pawel-czyz commented Aug 8, 2023 •

edited

Loading

gordonkoehn commented Sep 8, 2023

pawel-czyz commented Sep 9, 2023

gordonkoehn commented Sep 11, 2023

pawel-czyz commented Sep 12, 2023 •

edited

Loading

gordonkoehn commented Sep 12, 2023

Get SCITE Most-Likely Tree #141

Get SCITE Most-Likely Tree #141

Comments

gordonkoehn commented Aug 6, 2023 • edited Loading

gordonkoehn commented Aug 6, 2023

gordonkoehn commented Aug 6, 2023

pawel-czyz commented Aug 6, 2023

gordonkoehn commented Aug 8, 2023

pawel-czyz commented Aug 8, 2023 • edited Loading

gordonkoehn commented Sep 8, 2023

pawel-czyz commented Sep 9, 2023

gordonkoehn commented Sep 11, 2023

pawel-czyz commented Sep 12, 2023 • edited Loading

gordonkoehn commented Sep 12, 2023

gordonkoehn commented Aug 6, 2023 •

edited

Loading

pawel-czyz commented Aug 8, 2023 •

edited

Loading

pawel-czyz commented Sep 12, 2023 •

edited

Loading