Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need unit tests for calc...() popgen functions #474

Open
bhaller opened this issue Sep 25, 2024 · 7 comments
Open

Need unit tests for calc...() popgen functions #474

bhaller opened this issue Sep 25, 2024 · 7 comments

Comments

@bhaller
Copy link
Contributor

bhaller commented Sep 25, 2024

The need for unit tests for SLiM's popgen functions has been underlined by another discovery of a bug with them (https://groups.google.com/g/slim-discuss/c/Yacfk9EIYeU/m/bc72wVUzBAAJ). I'm not sure how to test them, though. I suppose a test could construct a population with known mutations, placed into the genomes at known positions/frequencies, and then test that the value calculated by the function matches the expected value calculated independently from first principles or by other software. If someone can supply me with a test scenario and an expected value, I can construct a corresponding SLiM test, but I don't have the knowledge necessary to come up with appropriate scenarios and expected values. These test scenarios wouldn't need to be large/complex; even a test with a genome of say, ten base positions long with, say, five mutations present and four diploid individuals (eight genomes) would be quite sufficient to test that the math and logic are correct, I would think. It would be good to have such tests for all of the calc...() functions. Perhaps @npb596 or @petrelharp or @philippmesser could help me with this?

@bhaller
Copy link
Contributor Author

bhaller commented Sep 25, 2024

This could take the form of a VCF file and an expected value. My SLiM test could simply load the VCF and check for a match (within reasonable numerical tolerance) to the expected value.

@petrelharp
Copy link
Collaborator

My recommendation is to not do expected values from theory (if that's what you meant); instead compare to the value calculated independently - either by other software or by a separate, first-principles implementation.

I don't want to take this on right now though - maybe a good student project?

@bhaller
Copy link
Contributor Author

bhaller commented Sep 25, 2024

OK. Why not expected values from theory?

@petrelharp
Copy link
Collaborator

Because that is so much more complicated - you have to worry about statistical power; how close is "close enough"; etcetera. That sort of thing is good for validation, but not so good for unit tests (for one thing you end up having to run a lot of simulatiosn to make sure). What we do in tskit, for instance, is usually just pull up the definition of the thing, then code up some real simple implementation that doesn't worry about efficiency; and compare to that. msprime does have a whole validation.py script that does statistical comparisons to other simulation software; but that's a much messier thing.

@bhaller
Copy link
Contributor Author

bhaller commented Sep 25, 2024

Because that is so much more complicated - you have to worry about statistical power; how close is "close enough"; etcetera. That sort of thing is good for validation, but not so good for unit tests (for one thing you end up having to run a lot of simulatiosn to make sure). What we do in tskit, for instance, is usually just pull up the definition of the thing, then code up some real simple implementation that doesn't worry about efficiency; and compare to that. msprime does have a whole validation.py script that does statistical comparisons to other simulation software; but that's a much messier thing.

Aha, I see. Yes, there are certainly problems with doing statistical tests for validation. SLiM already does tons of them, though. But if a precise comparison to the "right answer" is possible, that's certainly better!

@npb596
Copy link

npb596 commented Jan 14, 2025

Sorry for not responding to this until now. I saw it at the time but knew I'd be rather busy for the rest of that term, and I was, so I left it alone. I agree that comparison to other calculated values makes sense and sounds relatively doable. I don't know what computing expected values from theory would entail personally. I'm responding now because I did write a script with a simple sort of test. I did tests like it when writing the functions but never had a clean version to show until now. I'm attaching the script as a text file so Github lets me but it should really have a .slim suffix.

Test_PopGen_Functions.txt

It will run a simple neutral simulation with either a population contraction or expansion (one should be commented out at a time) and plot Tajima's D against time (in ticks). The basic qualitative expectation is that a contraction should increase Tajima's D and an expansion should decrease it. The change in pop size occurs halfway along the x-axis in the plot and it does seem to exhibit this pattern. Obviously there's also random variation in Tajima's D so sometimes the change halfway isn't obvious. I assume increasing pop size would reduce the variation but the simulation takes about a minute as is so I didn't want to push parameters too much. There's comments with specific parameters in script. Also, I ran it on the current release build so I define calcTajimasD2 and calcPi2 explicitly in there, but I took them straight from slim_functions.cpp so should match what's supposed to be built-in barring the current bug you noted.

The above could also be used to plot calcPi and calcWattersonsTheta. The changes in Tajima's D should correspond to changes in these. Also, in neutrality with constant pop size, both of those should fluctuate around $4Nm$ where N would be pop size and m is mutation rate. Tajima's D should fluctuate around 0. Checking those was something else I did in early testing, where it seemed good, but should be checked again. I guess you could incorporate selection to see if Tajima's D matches expectations there but that gets a bit more complicated.

Regarding comparisons to inference programs; SLiM can output VCFs so I assume those can be analyzed directly with something like VCFtools, which computes pi and Tajima's D. So someone could just compute the stats within SLiM at the time a VCF is output, and run VCFtools (or other programs) on that same VCF and see if they are about the same, right? For a different project I did some comparisons between software and used a dummy VCF I made matching a textbook example. I believe it was an alignment given in figure 3.1 of Matt Hahn's Molecular Population Genetics. I don't have the book on hand. The textbook does the pencil-and-paper pi and watterson's theta calcs on that alignment so I assume that would work as something with known mutations in known positions/frequencies. I could dig around for the dummy VCF.

Some of that I could probably do soon if it'd be of interest. I'd be happy to let someone else do it too but I'm assuming no one's biting if the post has been up this long.

@bhaller
Copy link
Contributor Author

bhaller commented Jan 14, 2025

This sounds great, @npb596. I'm rather buried in work right now – I'm extending SLiM to support multiple chromosomes, which has turned out to be rather more work than I expected :-O. It would be wonderful if you could run with this issue! :-> I'd suggest two things:

(1) I ran your test script, and it does seem good. For a test like this, I think it is fine for it to take a fairly long time to run, so I would suggest making it bigger to cut down on the noise and make the pattern more clearly visible. If it takes, say, an hour or even eight hours to run, I think that's fine; it won't be run often, but when it is run, it should provide a high-quality, rigorous test where a problem is clearly visible. I will run it manually before each release, with other manual tests that I use from time to time. Maybe the scale (pop size, if that is the right place to scale the test up up) should be a defined constant at the top, with a comment about what scale is suggested to produce a good plot without taking weeks/months. :-> It'd be great if you added the other calc functions to it, as you say. I would suggest that a good final resting place for this would be a new folder in the SLiM-Extras repository called tests. I've got a bunch of (totally unrelated) manual test scripts that I could contribute to that folder as well; it could be a place for non-trivial test scripts to go, which would be useful to everybody to run their own local tests, debug problems, derive related tests of things they want to try, etc. That would be really great; it's about time I got more organized on this front myself! Besides putting a final, polished test script in that new tests folder, a _README.txt file that discusses the test and says what the correct results ought to look like – similar to what you wrote above – would be great. Maybe the plot can have horizontal lines added, at the positions where the statistics are expected to equilibrate for the model? And a legend? (You'll get some practice with SLiM's custom plotting capabilities! :->) And a defineConstant() call at the top that defines a logical constant that chooses between expansion/contraction for the run? And remove the code for the calc functions, of course, to let it use the built-in functions; with the 4.3 release that will be wrong due to my screw-up, but with the GitHub head it should be correct, that commit went in September 24th of last year. All that wrapped up in a PR on SLiM-Extras would be wonderful. Please be sure to credit yourself for the work, in the script(s) and the README. :->

(By the way, the graph I got looks like this:

graph

From the way the line is going upward, I'm guessing the model had not yet reached equilibrium when the bottleneck started? But the drop caused by the expansion is certainly very visible! I guess it would be good for the model to run to equilibrium first, so that the height can actually be compared to an expected value? I'm a bit surprised it hadn't equilibrated, though, since the pop size is 2000 and the expansion happens at 80000, 40*N. This is the kind of thing where I'm just out of my depth and not sure what to think. :-O)

(2) Then, on the other hand, it'd be great to have actual "unit tests" of these functions, meaning very small, fast, deterministic tests that run every time SLiM does a self-test. So these should take just a small fraction of a second. What you wrote about VCFtools and Matt Hahn's book seems helpful. In the end, what we'd want is an input VCF file that is quite small (a couple of pages max, ideally smaller?), so that I can embed it right into SLiM's code as a string, and then the unit test can read the VCF in, use the calc functions to calculate values, and compare them to the expected values from Hahn or VCFtools or both. (Probably some tolerance for the comparison would be needed due to numerical error – if (abs(OBSERVED - EXPECTED) > EPSILON) stop("test failed!");, that sort of thing. If you write that as a standalone test model, I will do the work of putting it into SLiM's code to run as part of its automatic unit tests.

Anyhow, those are my ideas; please do whatever you think is good and have time to do. :-> I greatly appreciate the help, since this is an area where I don't really know what I'm doing, what constitutes a good tests for these functions, how to calculate the expected values, etc. :-> Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants