-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature idea: calcNucleotideDiversity() #409
Comments
Hi Chase! As I wrote on slim-discuss, I find all the different definitions of pi/heterozygosity quite confusing, and I have to admit I glaze over a bit when people try to explain them to me – I'm just not very mathy. So before I even attempt to dig in to this, I'm going to do two things. (1) Ask you whether you've also looked at the doc for I am basically asking for somebody else to do the work to implement this new API as a complete, polished user-defined Eidos function, with doc; and I will add it to SLiM and to the manual once that design work is finished. :-> Maybe I'll also tag @mnavascues, since I suspect he might also have input on this. This is an area where I really prefer to lean on the community for help; the design of |
Hey Ben and all! Thanks so much for the feedback and ideas. I too find the mathy popgen head-spinning — I only have a special interest in π, partly because I want to model it for some shall we say atypical (e.g. low-N or high-diversity) parameter space. So thanks for entertaining these questions and ideas — even if nothing is changed in SLiM, this discussion is incredibly valuable to me! My understanding of Eidos is very weak, but here's my stab at (1): if I understand it correctly, Related, I wonder if expected heterozygosity might be called something other than π, e.g., h? In my experience π usually refers specifically to nucleotide diversity (without replacement version) while h (or some version of the letter) refers specifically to expected heterozygosity (with replacement version). Even though they're estimating the same population parameter, maybe it's worth distinguishing the two to avoid confusion? (This view might just be artifact of my academic tradition, but wanted to suggest in case it's useful!) A final thought, probably most relevant to nucleotide-based models: I wonder if it's worth considering a version of these functions that only considers differences by state? For example, suppose N = 100 and the ancestral genome has T at a site. Suppose the site mutates as T>C and later back-mutates as C>T, such that at a particular time there are:
The allele frequencies by state are 75% T and 25% C, and π = (75 * 25) / ([1002 – 100] / 2) = 1875 / 4950 = 0.38. However, treating the alleles as distinct by descent (as current calculations do) would instead yield π = (50 * 25 + 50 * 25 + 25 * 25) / ([1002 – 100] / 2) = 3125 / 4950 = 0.63. Just a thought! Thanks so much for this discussion! |
Hi folks. I'm planning to roll SLiM 4.2 in perhaps a month. It'd be nice to do something about this issue, but as noted above, I need help with it. Tagging @philippmesser @petrelharp @mnavascues again to see if somebody wants to help with it; or @singing-scientist maybe you can take it across the finish line yourself? |
I've assigned this to @philippmesser at his request. He will handle it, since he understands the issues involved and I frankly just glaze over with this stuff. Of course I'll be happy to help with optimizing Eidos code and such, as needed – or writing a C++ implementation, if it proves difficult to implement in Eidos for some reason. But how to document it, what to say about these different methods and how they differ and what symbol they use and when you should use this or that, etc., just is not in my wheelhouse at all. :-> So now I'll consider this Somebody Else's Problem, in the immortal words of Douglas Adams. :-> |
Hey @bhaller apologies for the delay — bear of a week! My understanding has not really advanced beyond what I wrote above. To summarize:
So glad @philippmesser is willing to take this on! If there is a way you'd like me be of use (writing a paragraph?) please don't hesitate to let me know! |
This will not make SLiM 4.2; time has run out. Pushing to the future. |
Understood completely! |
This Issue follows from the SLiM Google Discussion on "Burn-in nucleotide diversity".
The idea is for a function, say
calcNucleotideDiversity()
, that calculates nucleotide diversity (π) as the mean number of differences per site between randomly chosen sequences without replacement, as described by Ralph. This would stand in contradistinction tocalcHeterozygosity()
, which if I understand correctly:I have drafted the following code to calculate π. It's meant to take the same input as
calcHeterozygosity()
, e.g., p1.genomes is passed as an argument, and could probably just be plugged into the "do the calculation" portion of that function (I did not understand the "windowing" portion so omitted that here). It's clearly much less efficient than h, but perhaps possible to improve, e.g., if loops can be vectorized?For a quick toy example with some multiallelism, this resulted in a value (0.0505) similar to calcHeterozygosity (0.0510), but I haven't benchmarked further. Just an idea! Even if not incorporated, would love feedback on this code.
Thanks a bunch!
Chase
The text was updated successfully, but these errors were encountered: