tree_sequence.genotype_matrix() contains non-segregating sites #1685
-
Hello! I would have thought that
running the above on my MacBook gives that I get this result running msprime==1.0.1 or msprime==1.0.0 on either my MacBook or on a linux server. Another (minor) issue is that on my MacBook, |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 2 replies
-
Hi @jeffspence thanks for posting your question. It is expected that the genotypes contain non-segregating sites, as the simulation can add both back-mutations and silent mutations (as mentioned at https://tskit.dev/msprime/docs/latest/api.html#msprime.sim_mutations). The number of non-segregating sites should be consistent with the model and parameters. As for differing results on differing OS installations, this is something that has occurred recently - we think due to differing versions of the GSL library that provides the randomness. How did you install msprime on each machine, pip, conda, or local build? |
Beta Was this translation helpful? Give feedback.
-
Thanks for this. For For GSL: both were installed using pip. One uses GSL version 2.6 and one uses GSL version 2.1 |
Beta Was this translation helpful? Give feedback.
-
Said another way - |
Beta Was this translation helpful? Give feedback.
-
Perhaps there should be an option to return only segregating sites? |
Beta Was this translation helpful? Give feedback.
-
@petrelharp that would be nice! It would also be nice to get out just diallelic sites, since that's a somewhat standard pre-processing step for real pop-gen data (i.e., throw away sites with anything other than 2 alleles). Those are both fairly easy to do post hoc now, though, so it's no issue to me. I was just surprised by the non-segregating sites and wanted to make sure that that wasn't some sort of bug. |
Beta Was this translation helpful? Give feedback.
-
I agree @petrelharp, it would be nice to have an option to return only segregating sites |
Beta Was this translation helpful? Give feedback.
-
I agree that this is something we'd like to make easy for people. Maybe just a So - is it really a common use case to want only the genotype matrix without the associated positions? We have also floated the idea of adding a method that removes all monomorphic sites (although, |
Beta Was this translation helpful? Give feedback.
Hi @jeffspence thanks for posting your question.
It is expected that the genotypes contain non-segregating sites, as the simulation can add both back-mutations and silent mutations (as mentioned at https://tskit.dev/msprime/docs/latest/api.html#msprime.sim_mutations). The number of non-segregating sites should be consistent with the model and parameters.
As for differing results on differing OS installations, this is something that has occurred recently - we think due to differing versions of the GSL library that provides the randomness. How did you install msprime on each machine, pip, conda, or local build?