-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding Genotype output in the VCF file generated using Pandora map and compare #317
Comments
Hi there .
|
Thanks for the explanation regarding the presence of '.' in the GT. Now, it's a bit more clear to me. If I understand correctly, then I can just count those cases as SNP's not present in the sample, because for my downstream analysis, I need that information whether SNP is present or absent in a sample. Further, I tried to use |
I would recommend the default, it is default for a good reason. Local was our first attempt, but it is hard to ensure your local genotypibg results in globally consistent genotypes |
If you're happy to share data, @leoisl and I would be interested to see examples where one approach worked and the other did not! |
Hello, sorry, I was on holidays and couldn't answer before. For point 2, I have nothing else to add besides what @iqbal-lab already said. For point 1, I can develop a bit further:
This is indeed the intended behaviour. Complementing what @iqbal-lab said in previous answers, by using the default global approach, We set global genotyping as default because that is what worked better in our data for the |
Hi @leoisl, Thanks for a nice and detailed explanation. The difference between the two approaches is a bit clear to me now. It would be great if there is any documentation about both the approaches where I can read about them in more detail Anyways, as in your last comment you suggested that for E.coli dataset that was used in the paper, the global approach worked better over the local approach. I am also using the E.coli data in my work and I have more than 1800 samples. So, I was trying to use |
I am using
pandora map
to align the reads against the pangenome reference and call out the variants using the--genotype
option. I ran the pandora map command using the default (ML path-oriented (global) approach) as well as--local
option (coverage-oriented (local) genotyping) to call out the variants and then compared the results (VCF files) generated using both the approaches. Here I am facing two issues:In the default global approach, I don't have any '1' in the genotype (GT) column (there is always either '.' or '0' infront of each SNP) even though the genotype confidence was around 4. While for the some of the same SNP's which has '.' in genotype column in global approach, we have '1' in genotype column using the local genotyping. So I went back and check the alignment for that gene and I can see that infact there is a actaully a SNP present at the location where the local approach says it is present, but the global approach says it is '.' So I am not sure which approach to use while calling out the variants, either global (which is default) or the local approach?
Another question is regarding the '.' which is present in the genotype column in the VCF file. Here, I think that genotype confidence (GT_CONF) required to make a call is low and this is why software is not able to make a call whether SNP is present or absent in the sample, because by default the minimum GT_CONF required to make a call is 1. So I changed the minimum GT_CONF required to make a call to 0 by setting the parameter
----gt-conf
to 0, but still there are lot of SNP's where we still have '.' in the genotype column. So, I was wondering that if there is any way somehow, so that we can get only '0' or '1' in the genotype column, which suggests that the SNP is either absent or present in that sample respectively. My ultimate aim is to usepandora compare
to call out the variants in the multiple samples, but when I ran that using the default global approach, I observed that there were lot of '.' in the genotype column (GT_CONF was 0). For some of these '.' that I checked randomly, I can see that the SNP is actually either present or absent in the sample but the software still outputs '.' instead of '0' or '1'.It would be great if someone can clear this doubts for me.
P.S: I am still using pandora version 0.9.1 (although 0.9.2 is available now) because this is still the latest version available with conda install.
The text was updated successfully, but these errors were encountered: