-
Notifications
You must be signed in to change notification settings - Fork 25
Clarify the counts in Beacon response #237
Comments
Here is our interpretation of what these counts mean.
We would like to know how to interpret these counts and validate our interpretation. Adding an example dataset (VCF) to the specification, and how to compute the counts on that dataset, would clarify this. |
This is great! We'll (all) should comb through & then move tis, also, to the Beacon website. To be clarified e.g.:
|
@teemukataja The statement
is misleading. A dataset in the "variant dataset" sense is a combination of 1 -> n "callsets", i.e. outputs from experiments on the material of a biosample. So the correct term here would be callsets (as said above, even "biosamples" would be tricky, since several replicates could have been included...) Individuals is definitely wrong. "samples" kind of skirts the problem, since the word is weakly defined; but assuming a VCF-like structure, the "sample"s would correspond to "callset"s (at least in standard use). |
Hi guys! Look at this example extracted from the 1000 genoms project:
We have 2504 samples (NS) and AC=5,11 (there are 2 alternates). The VCF spec says:
This means you count the number of 1s in the genotypes (e.g. 0/1) (if there is more than one alternate, it would be represented as 0/2 or 3/0, etc.). So, for the first alternate AC=5 (there are two 1| and three |1) and for the second alternate AC=11 (there are five |2 and six 2| ). I remember discussing what was the matching between the Beacon fields and the VCF fields, and somebody said (I couldn't find the conversation): frequency = AF, variantCount=AC, callCount=AN and sampleCount=NS but this may be true for AF and AC but not for AN and NS. These are their definitions:
You can calculate it: for the first alternate, AF=0.000998403=5/5008 (=AC/AN). For the second alternate, AF=0.00219649=11/5008
Reading the documentation and checking real VCFs, you can deduce that AN > NS and usually AN=2*NS but you can also calculate it: it's the number of elements in the genotype (e.g. if we only have 3 samples and the genotypes are 0|0 .|. 1|0, AN would be 4 because we don't have information for the 2nd sample).
So, NS definition cannot be used because in the Beacon we need the number of samples with the matching allele (that is, samples which have something different than 0 in the genotype, e.g. 1|0 or 0|1, in our example we have 5 samples which show the first alternate and 11 samples which show the second alternate) I think the main clarification is what callCount means in BeaconDatasetAlleleResponse as it does not match with the VCF definition of AN and nobody seems to know what it really means... Also, I think that sampleCount does not mean the number of individuals as you can have multiple samples of the same person (and then I think the definition would be "Total number of individuals in the dataset", but maybe this is what we really meant and we just need to fix the definition). |
(edited)
what do we do in those cases ? - also the case where there is no Most of these (edge) cases relate to the VCF format quirks rather than the beacon specification. I think we need to handle frequency calculation clarification in a different issue, as we assumed that
also, thanks for the pointers and clarifications :) - much appreciated. |
Jordi and I have been discussing this. We think that some of the field names in We also think that the most useful meaning of |
Clarify what different counts actually mean in Beacon response.
Related to #105
The text was updated successfully, but these errors were encountered: