-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulating gene trees with different ploidy and molecular clocks #16
Comments
This is a great question! and I'm sure there are better ways than others to estimate rate variation from empirical data. Something that I've been thinking about actually (and @laufran too). I don't know of any published method that would do so, on a network, and accounting for the coalescent. One basic idea might be to focus on genes whose tree matches the "major" tree in the network, and make the simplifying assumption that their branch lengths, in substitutions per sites, are equal to the scaled length of the major tree in the network, scaled by the rate of interest (substitutions / site / coalescent unit), with some residual variation. That would basically ignore the extra lengths in the gene trees that are due to the time taken to coalesce within a population. For the mitochondrial locus, Ne is 1/4 of what it is for autosomal loci in diploids, so that should be an easy factor to account for. Species trees and networks with edge lengths in number of generations seem like they are good "timetrees". But I don't think that coalescent units are generally good a good proxy for time, because of dependence on Ne. A bottleneck would make the coalescent units skyrocket (generations / very small number) compared to the background scaling of generations into coalescent units. Just my 2 cents here! |
Thank you @cecileane for your feedback! It seems that there are no available methods to specify certain empirically estimated parameters in network-based simulations,... so I will need to make some simplifying assumptions! |
Hmm, I do not understand how you got to that conclusion. Yes,
We could use these empirically estimated parameters to simulate gene trees like this: julia> using PhyloNetworks, PhyloCoalSimulations # version 1.0.0 of PhyloNetworks used here, and v1.0.0 of PhyloCoalSimulations
julia> net = readnewick("(A:1,B:1)root;")
HybridNetwork, Rooted Network
2 edges
3 nodes: 2 tips, 0 hybrid nodes, 1 internal tree nodes.
tip labels: A, B
(A:1.0,B:1.0)root;
julia> trees = simulatecoalescent(net, 2, 1; nodemapping=true); # simulate 2 gene trees, 1 ind/species
julia> writemultinewick(trees, stdout) # edge lengths are in coalescent units for now
((B:1.0)root:2.0122684387410628,(A:1.0)root:2.0122684387410628);
((A:1.0)root:1.2942200969881332,(B:1.0)root:1.2942200969881332); Next we want to convert the lengths in gene trees from coalescent units to substitutions per sites. We first map edge numbers in the species network to substitution rates: julia> printnodes(net) # node number 1=A, 2=B
node leaf hybrid name i_cycle edges'numbers
1 true false A -1 1
2 true false B -1 2
-2 false false root -1 1 2
julia> printedges(net) # edge 1 goes to A, edge 2 goes to B
edge parent child length hybrid ismajor gamma containroot i_cycle
1 -2 1 1.000 false true 1 true -1
2 -2 2 1.000 false true 1 true -1
julia> PhyloCoalSimulations.get_rootedgenumber(net) # edge above the root = edge 3
3
julia> networkedge_rate = Dict( # map edge number to substitutions / coal units
1 => 0.06/1, # edge root -> A
2 => 0.10/1, # edge root -> B
3 => 0.08, # edge above the root
)
Dict{Int64, Float64} with 3 entries:
2 => 0.1
3 => 0.08
1 => 0.06 We can now scale each edge in each gene tree by the rate of the population this gene edge evolved in: julia> for genetree in trees
for e in genetree.edge
e.length *= networkedge_rate[population_mappedto(e)]
end
end
julia> writemultinewick(trees, stdout) # edge lengths in substitutions now
((B:0.1)root:0.16098147509928504,(A:0.06)root:0.16098147509928504);
((A:0.06)root:0.10353760775905066,(B:0.1)root:0.10353760775905066);
julia> removedegree2nodes!.(trees, true); # clean gene trees: removed degree-2 nodes that tracked speciation time
julia> writemultinewick(trees, stdout) # lengths still in number of substitutions
(B:0.26098147509928504,A:0.22098147509928504);
(A:0.16353760775905066,B:0.20353760775905066);
|
Yes, my conclusion was too general. I realize that the estimated topology and branch lengths can be used to simulate gene trees within a network or a tree. I was specifically referring to the issue of lineage-specific rate variation for simulating relaxed clock models. However, as you pointed out, tools like ClockstaRX can be used to simulate this within species trees. |
I would say that |
I need to simulate gene trees and sequence data based on an empirical network.
I understand that I must convert branch lengths from coalescent units to substitutions per site using a distribution of rate variation among species, as described in the manual. Can I estimate the rate variation directly from the empirical data to set the parameters for the log-normal distribution?
The empirical data consist of several nuclear loci and one mitochondrial locus. How should I account for differences in effective population size (Ne)? Should I simulate the mitochondrial gene tree separately and adjust the branch lengths in the network accordingly?
When evaluating the simulated data, I need to compare strict vs. relaxed clock models. I was planning to use simclock (github.com/dosreislab/simclock), which assumes a timetree. I believe the gene tree in coalescent units can serve as a timetree, as these units represent the number of generations. Would you recommend a different approach?
Thanks to the team for the PCS package!
The text was updated successfully, but these errors were encountered: