Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulating gene trees with different ploidy and molecular clocks #16

Open
arleyc opened this issue Dec 11, 2024 · 5 comments
Open

simulating gene trees with different ploidy and molecular clocks #16

arleyc opened this issue Dec 11, 2024 · 5 comments

Comments

@arleyc
Copy link

arleyc commented Dec 11, 2024

I need to simulate gene trees and sequence data based on an empirical network.

I understand that I must convert branch lengths from coalescent units to substitutions per site using a distribution of rate variation among species, as described in the manual. Can I estimate the rate variation directly from the empirical data to set the parameters for the log-normal distribution?

The empirical data consist of several nuclear loci and one mitochondrial locus. How should I account for differences in effective population size (Ne)? Should I simulate the mitochondrial gene tree separately and adjust the branch lengths in the network accordingly?

When evaluating the simulated data, I need to compare strict vs. relaxed clock models. I was planning to use simclock (github.com/dosreislab/simclock), which assumes a timetree. I believe the gene tree in coalescent units can serve as a timetree, as these units represent the number of generations. Would you recommend a different approach?

Thanks to the team for the PCS package!

@cecileane
Copy link
Member

This is a great question! and I'm sure there are better ways than others to estimate rate variation from empirical data. Something that I've been thinking about actually (and @laufran too). I don't know of any published method that would do so, on a network, and accounting for the coalescent.

One basic idea might be to focus on genes whose tree matches the "major" tree in the network, and make the simplifying assumption that their branch lengths, in substitutions per sites, are equal to the scaled length of the major tree in the network, scaled by the rate of interest (substitutions / site / coalescent unit), with some residual variation. That would basically ignore the extra lengths in the gene trees that are due to the time taken to coalesce within a population.

For the mitochondrial locus, Ne is 1/4 of what it is for autosomal loci in diploids, so that should be an easy factor to account for.
On trees, there are a few methods that aim to estimate a rate specific to each species lineage, I think. I am most familiar with papers that we cite in this preprint. For example, ClockstaRX by Duchêne et al. 2024

Species trees and networks with edge lengths in number of generations seem like they are good "timetrees". But I don't think that coalescent units are generally good a good proxy for time, because of dependence on Ne. A bottleneck would make the coalescent units skyrocket (generations / very small number) compared to the background scaling of generations into coalescent units. Just my 2 cents here!

@arleyc
Copy link
Author

arleyc commented Dec 18, 2024

Thank you @cecileane for your feedback! It seems that there are no available methods to specify certain empirically estimated parameters in network-based simulations,... so I will need to make some simplifying assumptions!

@cecileane
Copy link
Member

Hmm, I do not understand how you got to that conclusion. Yes, PhyloCoalSimulations can specify certain empirically estimated parameters.
Here is an example, with 2 species A and B to make it simple. Let's say that we empirically estimated:

  • a species tree (no reticulation, to make the example simple)
  • 1 coalescent unit between the speciation event and A, also 1 coalescent unit between the speciation event and B (again, numbers to make things simple)
  • we also estimated the average # of substitutions along each species lineage, say 0.06 in A, 0.10 in B. That's 0.06/1 = 0.06 substitutions / coalescent unit in A, and 0.10/1 = 0.10 substitutions / coalescent units in B. And, say, a substitution rate in the lineage prior to their speciation that is equal to the average rate in A and B: 0.08.

We could use these empirically estimated parameters to simulate gene trees like this:

julia> using PhyloNetworks, PhyloCoalSimulations # version 1.0.0 of PhyloNetworks used here, and v1.0.0 of PhyloCoalSimulations

julia> net = readnewick("(A:1,B:1)root;")
HybridNetwork, Rooted Network
2 edges
3 nodes: 2 tips, 0 hybrid nodes, 1 internal tree nodes.
tip labels: A, B
(A:1.0,B:1.0)root;

julia> trees = simulatecoalescent(net, 2, 1; nodemapping=true); # simulate 2 gene trees, 1 ind/species

julia> writemultinewick(trees, stdout) # edge lengths are in coalescent units for now
((B:1.0)root:2.0122684387410628,(A:1.0)root:2.0122684387410628);
((A:1.0)root:1.2942200969881332,(B:1.0)root:1.2942200969881332);

Next we want to convert the lengths in gene trees from coalescent units to substitutions per sites. We first map edge numbers in the species network to substitution rates:

julia> printnodes(net) # node number 1=A, 2=B
node leaf  hybrid name i_cycle edges'numbers
1    true  false  A    -1      1   
2    true  false  B    -1      2   
-2   false false  root -1      1    2   

julia> printedges(net) # edge 1 goes to A, edge 2 goes to B
edge parent child  length  hybrid ismajor gamma   containroot i_cycle
1    -2     1      1.000   false  true    1       true        -1     
2    -2     2      1.000   false  true    1       true        -1     

julia> PhyloCoalSimulations.get_rootedgenumber(net) # edge above the root = edge 3
3

julia> networkedge_rate = Dict(  # map edge number to substitutions / coal units
         1 => 0.06/1, # edge root -> A
         2 => 0.10/1, # edge root -> B
         3 => 0.08,   # edge above the root
      )
Dict{Int64, Float64} with 3 entries:
  2 => 0.1
  3 => 0.08
  1 => 0.06

We can now scale each edge in each gene tree by the rate of the population this gene edge evolved in:

julia> for genetree in trees
         for e in genetree.edge
           e.length *= networkedge_rate[population_mappedto(e)]
         end
       end

julia> writemultinewick(trees, stdout) # edge lengths in substitutions now
((B:0.1)root:0.16098147509928504,(A:0.06)root:0.16098147509928504);
((A:0.06)root:0.10353760775905066,(B:0.1)root:0.10353760775905066);

julia> removedegree2nodes!.(trees, true); # clean gene trees: removed degree-2 nodes that tracked speciation time

julia> writemultinewick(trees, stdout) # lengths still in number of substitutions
(B:0.26098147509928504,A:0.22098147509928504);
(A:0.16353760775905066,B:0.20353760775905066);

@arleyc
Copy link
Author

arleyc commented Dec 18, 2024

Yes, my conclusion was too general. I realize that the estimated topology and branch lengths can be used to simulate gene trees within a network or a tree. I was specifically referring to the issue of lineage-specific rate variation for simulating relaxed clock models. However, as you pointed out, tools like ClockstaRX can be used to simulate this within species trees.

@cecileane
Copy link
Member

I would say that ClockstaRX can estimate empirical parameters from data (but not simulate), and then PhyloCoalSimulations can use these parameters to simulate within species trees and networks (but not estimate the empirical parameters).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants