Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-MP Node Support #4

Open
williamhowardsnyder opened this issue Feb 10, 2023 · 3 comments
Open

Non-MP Node Support #4

williamhowardsnyder opened this issue Feb 10, 2023 · 3 comments

Comments

@williamhowardsnyder
Copy link
Contributor

williamhowardsnyder commented Feb 10, 2023

Node Support Definition

Node support is an estimate of the confidence that a clade is in the true tree. Formally, this is the sum over all posterior probabilities for trees that the node appears in. Let $v$ be our node of interest and let the set of trees that a node takes part in be $T_v = set(t \in T : v \in t)$, where $T$ is the set of all trees with non-zero posterior. Then, the support can be written as
$$Pr(v \mid D) = \sum_{t \in T_v} Pr( t \mid D)$$

Our current assumption is that all MP trees have uniform probability (we also assume the hDAG contains all MP trees or at least a good chunk of them). In that case,
$$Pr( t \mid D) = 1 / |T|,$$
where $T$ is the trees in the DAG. And so our support becomes
$$Pr(v \mid D) = |T_v| / |T|,$$
In other words, the support of $v$ is the proportion of trees in the DAG that contain our node $v$. The idea is that all MP trees are equally optimal candidates, so the posterior should reflect that.

However, there are two challenges we've identified with this approach:

  1. In simulation, we often find that the true tree is not MP, although usually it is close.
  2. There tend to be very low support nodes (i.e., most nodes appear in all MP trees). This is great if we only care about guessing which nodes are in the true tree. However, we'd also like to assign "accurate" node supports to nodes that are less common.

Is there a way to redefine the posterior on trees in the DAG to account for these challenges? A couple of ideas that might be worth exploring in this benchmarking effort:

  • Uniform distribution over trees (MP or not) in the DAG. We could trim the DAG to a parsimony threshold to try to remove trees that are really bad. We could also trim to MP trees, then re-run larch-usher, which is biased towards finding high parsimony trees.
  • Consider all trees, but with a posterior that is a function of parsimony. E.g., $Pr(t(p) \mid D) \propto e^{-p}$ where $p$ is the parsimony score of $t$. Computing this would require decomposing tree probabilities over edges, which would be a project in the DAG realm.
@matsen
Copy link
Contributor

matsen commented Feb 10, 2023

There is some precedent for your option 2, but how are you going to get these sub-parsimonious trees? My hope is that the fitch-staring that @willdumm is doing will allow us to understand these implicitly.

Or can we have a neural network find the signal for us? 😎

@willdumm
Copy link
Contributor

willdumm commented Feb 21, 2023

For option 2, I'd like to do something (e.g. what William suggested in 1, or the fruits of my tree-staring project) to make the DAG include all the trees that we want to consider, so we can truncate the underlying distribution (like $e^{-p}$ that William suggests) to the collection of trees in the DAG.

I'd like to understand how this truncated distribution decomposes over DAG edges, and how it can be represented as a collection of downward-conditional edge probability distributions on sets of edges descending from each node-clade of the dag. For example, the distribution $Pr(t(p) \mid D) \propto e^{-p}$ decomposes as a product over edges, since parsimony score $p$ decomposes as a sum over edges. However, we don't want this distribution, but rather this distribution conditioned on a history being in the set $T$ of histories in the DAG: $Pr(t(p) \mid D, t\in T)$.

In order to compute downward conditional distributions representing this conditional distribution on histories in $T$ we could start by computing probabilities of each node in the DAG. This is easy in a DAG containing all possible histories on the data: we can use the following decomposition of a history's probability:

image

... where in (2) the sum is over all such history fragments, and the dot represents a node whose probability we're trying to compute. The farthest right equality in (2) decomposes a node's probability into a downward part and an upward part. Each of these can be computed in the DAG efficiently by dynamic programming. Note that (2) gives node support. The issue is what happens when the sum is restricted to history fragments which are in a DAG which does not contain all possible histories.

@williamhowardsnyder @marybarker at some point (I guess next week when I'm back in person, but this week could also work) I'd like to work through this with you. Mary, I know we thought about this a lot, and maybe you totally get it now, but it would be great for me to finally understand clearly. This is similar to stuff that's already been figured out for the sDAG, which is more complicated since in the hDAG we're considering probabilities which decompose over edges, without marginalizing over possible parent/child sequences, etc.

@marybarker
Copy link

marybarker commented Mar 1, 2023

Here's a brief summary of a more updated version of parsimony-weighted node support, based on our discussions this past week:

Notation

Given an edge $e = (p, c)$ between parent and child nodes $p, c$\ respectively, define the following functions of e:

  • $cl(e)$ the clade from which $e$ descends (a clade here is considered as a set of edges)
  • $p(e)$ the parent node of $e$
  • $c(e)$ the child node of $e$
  • $T(e)$ the set of all subtrees with root $p(e)$, and which pass through $c(e)$. i.e. all of the subtrees emanating from $e$.

Tree probabilities that decompose over edges

Let $F(t)$ be a non-negative function that assigns values to the trees $t$ in the DAG.
We say that $F$ decomposes over edges if for all trees $t$,
$$F(t) = \prod\limits_{e \in t}F(e)$$
That is, we can write the function $F$ as a product over the edges in the tree $t$.
(slight abuse of notation, $F(t)$, $F(e)$)

Given this edge-weight function, we can then sample trees from the DAG with the probabilitity distribution
$$P(t) = \frac{F(t)}{\sum\limits_{t'\in DAG}F(t')}$$
And these probabilities can be precomputed for all of the trees in the DAG in a dynamic programming traversal of the edges in the DAG.
In the historydag project, this is done using the postorder_history_accum function to aggregate subtree information in a single postorder traversal of the DAG edges.
In this traversal, we define the downward conditional edge probability for each edge $e$ as
$$P(e) := \frac{\sum\limits_{t\in T(e)} F(t)}{\sum\limits_{e'\in Cl(e)}\sum\limits_{t'\in T(e')}F(t')}$$
This is the probability that the edge $e$ is in a tree sampled from the distribution, given that $p(e)$ is in the sampled tree.

Example: parsimony weighting

Let $F(e) = exp(-H(e))$ where $H(e)$ is the hamming distance between node the node labels of $p(e)$ and $c(e)$.
Then
$$F(t) = \prod\limits_{e\in t}F(e) = \prod\limits_{e\in t}exp(-H(e)) = exp(-p)$$
where $p$ is the parsimony score of $t$.
Thus we can weight the trees in the DAG by their parsimony score, so that trees with better parsimony score are assigned greater weight.
(sorry for using exp, but we need to keep edges and exponentials separate)

Caveat

Note that a probability distribution on the DAG that is based on a numeric value that varies between trees does not ensure that we sample the highest-value trees with the greatest frequency, since the probability distribution is a weighted frequency distribution.
Thus, in the parsimony example, if the DAG has a high proportion of subparsimonious trees, then the distribution will reflect this proportion, although it is skewed to favor the weighted proportion of maximally parsimonious trees.
If the best parsimony score of any tree in the DAG is $p$, and the number of maximally parsimonious trees is $MP$, then
The probability for a given maximally parsimonious tree $t$ is
$$P(t) = \frac{exp(-p)}{\sum\limits_{t'\in DAG}F(t')}$$
The numerator may be higher for $t$ than for a subparsimonious tree, but the overall probability of choosing a maximally parsimonious tree is
$$\frac{exp(-p)\cdot MP}{\sum\limits_{t'\in DAG}F(t')}$$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants