You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I analyzed targeted sequencing data of 100 PGx genes aligned to the hg19 reference genome using both BAM and VCF files in Aldy v4.4 (Python 3.9.7 on Linux 3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17). Discrepancies in star allele calls between BAM and VCF of the same sample were often understandable, attributed to the fact that alleles with CNVs could not be detected from VCF.
However, some differences could not be explained by this limitation.
For instance, in CYP2C19, sample that was called *1/*1 from the BAM file got *38/*38 from the VCF file.
BAM analysis: *1/*1
VCF analysis: *38/*38
Based on PharmVar:
*38 = wild-type (no variant)
*1 = rs3758581 (A>G)
In the genome browser, the BAM file of this sample showed homozygous G at this position. Since G is the reference allele in hg19, it wasn't shown as a variant in the VCF file.
The BAM file showed homozygous G, so it's logical that Aldy called *1/*1 for this sample. However, I'm confused about why the VCF file, which also represented homozygous G, didn't produce the same result?
By the way, I noticed an interesting difference between reference genomes. In GRCh38, the G allele at this position is considered the alternate allele, making it a variant in VCF file, unlike in hg19.
Could it be that the program expected the G allele of CYP2C19*1 to be recognized as a variant in the VCF file (like in GRCh38)? However, because it is a reference allele in hg19 and doesn't appear in the VCF, the program skipped it?
The VCF log file also indicates that the *1 allele was removed because rs3758581 wasn't present, resulting in it being called as *38/*38 instead.
Another gene with a similar discrepancy is CYP3A7.
Sample that was called *2/*2 when using the BAM file got *1/*1 from the VCF file.
BAM analysis: *2/*2
VCF analysis: *1/*1
However, sample that was called *1/*2 from the BAM file also got *1/*2 from the VCF file.
BAM analysis: *1/*2
VCF analysis: *1/*2
This suggested that *2 might not be detected in VCF files only when it is homozygous?
Based on PharmVar:
*1 = wild-type (no variant)
*2 = rs2257401 (G>C on positive strand)
In the genome browser, sample 392 with *2/*2 showed homozygous C at this position. Since C is the reference allele in hg19, it did not show up in the VCF.
Sample 394 with *1/*2 had both G and C at this position. Since G is the alternate allele in hg19, it was included in the VCF.
Similar to the previously mentioned CYP2C19 SNP, the C allele at this position is the reference allele in hg19, but is an alternate allele in GRCh38.
So, again, it seems that the program might anticipate the C allele of CYP3A7*2 to be in the VCF file (like in GRCh38), but because it is a reference allele in hg19 and was not present in the VCF, the program did not considered it and called *1 instead.
Can anyone please check if I've understood this correctly? Or are there other possible reasons for these differences in Aldy results from the BAM and VCF files?
Feel free to ask for more information, and thank you in advance.
Paravee
The text was updated successfully, but these errors were encountered:
Would it be possible to email me these samples (you can subsample them to the problematic regions) so that I can take a closer look at what's happening?
Dear Aldy Team,
I analyzed targeted sequencing data of 100 PGx genes aligned to the hg19 reference genome using both BAM and VCF files in Aldy v4.4 (Python 3.9.7 on Linux 3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17). Discrepancies in star allele calls between BAM and VCF of the same sample were often understandable, attributed to the fact that alleles with CNVs could not be detected from VCF.
However, some differences could not be explained by this limitation.
For instance, in CYP2C19, sample that was called *1/*1 from the BAM file got *38/*38 from the VCF file.
BAM analysis: *1/*1
VCF analysis: *38/*38
Based on PharmVar:
*38 = wild-type (no variant)
*1 = rs3758581 (A>G)
In the genome browser, the BAM file of this sample showed homozygous G at this position. Since G is the reference allele in hg19, it wasn't shown as a variant in the VCF file.
The BAM file showed homozygous G, so it's logical that Aldy called *1/*1 for this sample. However, I'm confused about why the VCF file, which also represented homozygous G, didn't produce the same result?
By the way, I noticed an interesting difference between reference genomes. In GRCh38, the G allele at this position is considered the alternate allele, making it a variant in VCF file, unlike in hg19.
Could it be that the program expected the G allele of CYP2C19*1 to be recognized as a variant in the VCF file (like in GRCh38)? However, because it is a reference allele in hg19 and doesn't appear in the VCF, the program skipped it?
The VCF log file also indicates that the *1 allele was removed because rs3758581 wasn't present, resulting in it being called as *38/*38 instead.
Another gene with a similar discrepancy is CYP3A7.
Sample that was called *2/*2 when using the BAM file got *1/*1 from the VCF file.
BAM analysis: *2/*2
VCF analysis: *1/*1
However, sample that was called *1/*2 from the BAM file also got *1/*2 from the VCF file.
BAM analysis: *1/*2
VCF analysis: *1/*2
This suggested that *2 might not be detected in VCF files only when it is homozygous?
Based on PharmVar:
*1 = wild-type (no variant)
*2 = rs2257401 (G>C on positive strand)
In the genome browser, sample 392 with *2/*2 showed homozygous C at this position. Since C is the reference allele in hg19, it did not show up in the VCF.
Sample 394 with *1/*2 had both G and C at this position. Since G is the alternate allele in hg19, it was included in the VCF.
Similar to the previously mentioned CYP2C19 SNP, the C allele at this position is the reference allele in hg19, but is an alternate allele in GRCh38.
So, again, it seems that the program might anticipate the C allele of CYP3A7*2 to be in the VCF file (like in GRCh38), but because it is a reference allele in hg19 and was not present in the VCF, the program did not considered it and called *1 instead.
Can anyone please check if I've understood this correctly? Or are there other possible reasons for these differences in Aldy results from the BAM and VCF files?
Feel free to ask for more information, and thank you in advance.
Paravee
The text was updated successfully, but these errors were encountered: