Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between Aldy results from BAM vs VCF files (hg19 reference genome) #74

Open
paraveeown opened this issue Apr 6, 2024 · 3 comments

Comments

@paraveeown
Copy link

Dear Aldy Team,

I analyzed targeted sequencing data of 100 PGx genes aligned to the hg19 reference genome using both BAM and VCF files in Aldy v4.4 (Python 3.9.7 on Linux 3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17). Discrepancies in star allele calls between BAM and VCF of the same sample were often understandable, attributed to the fact that alleles with CNVs could not be detected from VCF.

However, some differences could not be explained by this limitation.

For instance, in CYP2C19, sample that was called *1/*1 from the BAM file got *38/*38 from the VCF file.

BAM analysis: *1/*1
image
VCF analysis: *38/*38
image

Based on PharmVar:
*38 = wild-type (no variant)
*1 = rs3758581 (A>G)

In the genome browser, the BAM file of this sample showed homozygous G at this position. Since G is the reference allele in hg19, it wasn't shown as a variant in the VCF file.
image

The BAM file showed homozygous G, so it's logical that Aldy called *1/*1 for this sample. However, I'm confused about why the VCF file, which also represented homozygous G, didn't produce the same result?

By the way, I noticed an interesting difference between reference genomes. In GRCh38, the G allele at this position is considered the alternate allele, making it a variant in VCF file, unlike in hg19.
image

Could it be that the program expected the G allele of CYP2C19*1 to be recognized as a variant in the VCF file (like in GRCh38)? However, because it is a reference allele in hg19 and doesn't appear in the VCF, the program skipped it?

The VCF log file also indicates that the *1 allele was removed because rs3758581 wasn't present, resulting in it being called as *38/*38 instead.
image


Another gene with a similar discrepancy is CYP3A7.
Sample that was called *2/*2 when using the BAM file got *1/*1 from the VCF file.

BAM analysis: *2/*2
image
VCF analysis: *1/*1
image

However, sample that was called *1/*2 from the BAM file also got *1/*2 from the VCF file.
BAM analysis: *1/*2
image
VCF analysis: *1/*2
image

This suggested that *2 might not be detected in VCF files only when it is homozygous?

Based on PharmVar:
*1 = wild-type (no variant)
*2 = rs2257401 (G>C on positive strand)

In the genome browser, sample 392 with *2/*2 showed homozygous C at this position. Since C is the reference allele in hg19, it did not show up in the VCF.
Sample 394 with *1/*2 had both G and C at this position. Since G is the alternate allele in hg19, it was included in the VCF.
image

Similar to the previously mentioned CYP2C19 SNP, the C allele at this position is the reference allele in hg19, but is an alternate allele in GRCh38.
image

So, again, it seems that the program might anticipate the C allele of CYP3A7*2 to be in the VCF file (like in GRCh38), but because it is a reference allele in hg19 and was not present in the VCF, the program did not considered it and called *1 instead.

Can anyone please check if I've understood this correctly? Or are there other possible reasons for these differences in Aldy results from the BAM and VCF files?

Feel free to ask for more information, and thank you in advance.

Paravee

@inumanag
Copy link
Collaborator

Hi @paraveeown : thank you for this issue!

Would it be possible to email me these samples (you can subsample them to the problematic regions) so that I can take a closer look at what's happening?

@paraveeown
Copy link
Author

Hello @inumanag. Thank you for responding! I'll prepare the data and send them to you soon.

@paraveeown
Copy link
Author

paraveeown commented May 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants