Feature idea: calcNucleotideDiversity() #409

singing-scientist · 2023-11-17T07:29:29Z

This Issue follows from the SLiM Google Discussion on "Burn-in nucleotide diversity".

The idea is for a function, say calcNucleotideDiversity(), that calculates nucleotide diversity (π) as the mean number of differences per site between randomly chosen sequences without replacement, as described by Ralph. This would stand in contradistinction to calcHeterozygosity(), which if I understand correctly:

calculates the mean number of differences per site between randomly chosen sequences with replacement, i.e., expected heterozygosity (h), which can diverge from expectation for low N settings
does not provide the expected value for multiallelic (i.e., >1 mutation) sites, which can diverge from expectation for high diversity settings

I have drafted the following code to calculate π. It's meant to take the same input as calcHeterozygosity(), e.g., p1.genomes is passed as an argument, and could probably just be plugged into the "do the calculation" portion of that function (I did not understand the "windowing" portion so omitted that here). It's clearly much less efficient than h, but perhaps possible to improve, e.g., if loops can be vectorized?

// input p1.genomes
genomes = p1.genomes;

// obtain genome length
length = species.chromosome.lastPosition + 1;

// obtain number of non-NULL genomes in the population
num_genomes = 0;
for (this_genome in genomes)
{
	if (!isNULL(this_genome))
	{
		num_genomes = num_genomes + 1;
	}
}

// obtain positions with at least one mutation
muts = species.mutations;
mpos = muts.position;

// initialize pi numerator for whole genome
pi_numerator = 0;

// loop unique mutant positions
for (this_pos in sort(unique(mpos)))
{
	this_pos_muts = muts[mpos == this_pos];
	this_pos_muts_counts = genomes.mutationCountsInGenomes(this_pos_muts);
	this_pos_muts_counts_sum = sum(this_pos_muts_counts);
	this_pos_muts_counts_rem = num_genomes - this_pos_muts_counts_sum;
	
	// count all unique alleles (including non-mutant if it exists) at this position
	this_pos_allele_counts = c(this_pos_muts_counts, this_pos_muts_counts_rem);
	
	// count pairwise diffs at this position
	diffs = 0;
	
	for (i in seq(0, length(this_pos_allele_counts) - 1))
	{
		j = i + 1;
		while (j < length(this_pos_allele_counts))
		{
			this_diffs = this_pos_allele_counts[i] * this_pos_allele_counts[j];
			diffs = diffs + this_diffs;
			j = j + 1;
		}
	}
	
	// calculate pi value at this position
	comps = (num_genomes^2 - num_genomes) / 2;
	this_pos_pi = diffs / comps;
	
	// add this position's pi to numerator
	pi_numerator = pi_numerator + this_pos_pi;
}

// calculate pi for whole genome length, i.e., mean pairwise differences per site
nucleotideDiversity = pi_numerator / length;

For a quick toy example with some multiallelism, this resulted in a value (0.0505) similar to calcHeterozygosity (0.0510), but I haven't benchmarked further. Just an idea! Even if not incorporated, would love feedback on this code.

Thanks a bunch!
Chase

The text was updated successfully, but these errors were encountered:

bhaller · 2023-11-17T14:52:26Z

Hi Chase! As I wrote on slim-discuss, I find all the different definitions of pi/heterozygosity quite confusing, and I have to admit I glaze over a bit when people try to explain them to me – I'm just not very mathy. So before I even attempt to dig in to this, I'm going to do two things. (1) Ask you whether you've also looked at the doc for calcPairHeterozygosity(), and whether it perhaps does what you want? (2) Tag @petrelharp and @philippmesser, who actually understand this stuff, to get their input on this. Specifically: is there a need for the new API that Chase proposes (in addition to the two functions SLiM already provides)? What should it be named? Does Chase's code to compute it look reasonable, or is there a better approach? (And windowing ought to be added back in to its implementation also.) What should the documentation for that new method say, exactly, to make it clear how it differs from the two existing functions?

I am basically asking for somebody else to do the work to implement this new API as a complete, polished user-defined Eidos function, with doc; and I will add it to SLiM and to the manual once that design work is finished. :-> Maybe I'll also tag @mnavascues, since I suspect he might also have input on this. This is an area where I really prefer to lean on the community for help; the design of calcHeterozygosity() and calcFST() both went around and around for many iterations, because I'm clueless about this kind of thing, and it was a bit frustrating. So – any takers? :->

singing-scientist · 2023-11-19T06:12:51Z

Hey Ben and all! Thanks so much for the feedback and ideas. I too find the mathy popgen head-spinning — I only have a special interest in π, partly because I want to model it for some shall we say atypical (e.g. low-N or high-diversity) parameter space. So thanks for entertaining these questions and ideas — even if nothing is changed in SLiM, this discussion is incredibly valuable to me!

My understanding of Eidos is very weak, but here's my stab at (1): if I understand it correctly, calcPairHeterozygosity() with infiniteSites = F would yield the 'actual' (with replacement) value of π, IF you ran it for all (n² – n) / 2 pairs of genomes in the population and took the average. This would work because two genomes can have at most two distinct alleles, i.e., multiallelism at the site wouldn't cause problems like it does using calcHeterozygosity(). My gut says all those function calls would be much less efficient than what was proposed above, but not sure. However, maybe it does obviate the utility of a new function.

Related, I wonder if expected heterozygosity might be called something other than π, e.g., h? In my experience π usually refers specifically to nucleotide diversity (without replacement version) while h (or some version of the letter) refers specifically to expected heterozygosity (with replacement version). Even though they're estimating the same population parameter, maybe it's worth distinguishing the two to avoid confusion? (This view might just be artifact of my academic tradition, but wanted to suggest in case it's useful!)

A final thought, probably most relevant to nucleotide-based models: I wonder if it's worth considering a version of these functions that only considers differences by state? For example, suppose N = 100 and the ancestral genome has T at a site. Suppose the site mutates as T>C and later back-mutates as C>T, such that at a particular time there are:

50 T_anc
25 C_der
25 T_der

The allele frequencies by state are 75% T and 25% C, and π = (75 * 25) / ([100² – 100] / 2) = 1875 / 4950 = 0.38. However, treating the alleles as distinct by descent (as current calculations do) would instead yield π = (50 * 25 + 50 * 25 + 25 * 25) / ([100² – 100] / 2) = 3125 / 4950 = 0.63. Just a thought!

Thanks so much for this discussion!
Chase

bhaller · 2024-02-20T17:43:09Z

Hi folks. I'm planning to roll SLiM 4.2 in perhaps a month. It'd be nice to do something about this issue, but as noted above, I need help with it. Tagging @philippmesser @petrelharp @mnavascues again to see if somebody wants to help with it; or @singing-scientist maybe you can take it across the finish line yourself?

bhaller · 2024-02-22T18:02:20Z

I've assigned this to @philippmesser at his request. He will handle it, since he understands the issues involved and I frankly just glaze over with this stuff. Of course I'll be happy to help with optimizing Eidos code and such, as needed – or writing a C++ implementation, if it proves difficult to implement in Eidos for some reason. But how to document it, what to say about these different methods and how they differ and what symbol they use and when you should use this or that, etc., just is not in my wheelhouse at all. :-> So now I'll consider this Somebody Else's Problem, in the immortal words of Douglas Adams. :->

singing-scientist · 2024-02-22T20:17:37Z

Hey @bhaller apologies for the delay — bear of a week! My understanding has not really advanced beyond what I wrote above. To summarize:

I feel it would be valuable to distinguish h = heterozygosity (with replacement calculation) vs. π = nucleotide diversity (without replacement calculation); if I understand correctly, at present the term 'nucleotide diversity' is actually used to refer to h in the manual and this may be a source of confusion
I feel fairly confident in the accuracy of my code for a proposed calcNucleotideDiversity() for calculating π, but I'm not sure how computationally in/efficient it would be in practice. A possible alternative might be to write a version that effectively calls calc**Pair**Heterozygosity for all pairs of genomes
least important but probably quite useful, especially in nucleotide-based models, I think it would be valuable to be able to calculate π where alleles are considered by state rather than by descent, e.g., all the T nucleotides at a site are considered the same allele in the calculation, even if some of them are ancestral and others were derived via a reversion-like T>N>T sequence of events

So glad @philippmesser is willing to take this on! If there is a way you'd like me be of use (writing a paragraph?) please don't hesitate to let me know!

bhaller · 2024-03-21T00:53:48Z

This will not make SLiM 4.2; time has run out. Pushing to the future.

singing-scientist · 2024-03-21T04:56:09Z

Understood completely!

bhaller added enhancement help wanted labels Nov 17, 2023

bhaller assigned philippmesser Feb 22, 2024

bhaller added the long-term label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature idea: calcNucleotideDiversity() #409

Feature idea: calcNucleotideDiversity() #409

singing-scientist commented Nov 17, 2023

bhaller commented Nov 17, 2023

singing-scientist commented Nov 19, 2023

bhaller commented Feb 20, 2024

bhaller commented Feb 22, 2024

singing-scientist commented Feb 22, 2024

bhaller commented Mar 21, 2024

singing-scientist commented Mar 21, 2024

Feature idea: calcNucleotideDiversity() #409

Feature idea: calcNucleotideDiversity() #409

Comments

singing-scientist commented Nov 17, 2023

bhaller commented Nov 17, 2023

singing-scientist commented Nov 19, 2023

bhaller commented Feb 20, 2024

bhaller commented Feb 22, 2024

singing-scientist commented Feb 22, 2024

bhaller commented Mar 21, 2024

singing-scientist commented Mar 21, 2024