Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid VCF according to VCF specification #261

Open
aligogon opened this issue Oct 9, 2024 · 2 comments
Open

Invalid VCF according to VCF specification #261

aligogon opened this issue Oct 9, 2024 · 2 comments

Comments

@aligogon
Copy link

aligogon commented Oct 9, 2024

Hi @lgmgeo !
After obtaining a VCF from AnnotSV I used EBI VCF validator (https://github.com/EBIvariation/vcf-validator) obtaining some errors:

According to the VCF specification, the input file is not valid
Error: Metadata ID contains a character different from alphanumeric, dot, underscore and dash. This occurs 7 time(s), first time in line 19.
Error: INFO ACMG_class does not match the meta specification Number=1 (expected 1 value(s)). This occurs 6 time(s), first time in line 349.

Regarding the first error, I have corrected it by modifying the following IDs: of the header:

##INFO=<ID='SampleID'
##INFO=<ID=Compound_htz(sample)
##INFO=<ID=Count_hom(sample)
##INFO=<ID=Count_htz(sample)
##INFO=<ID=Count_htz/allHom(sample)
##INFO=<ID=Count_htz/total(cohort)
##INFO=<ID=Count_total(cohort)

Into these:

##INFO=<ID=SampleID
##INFO=<ID=Compound_htz_sample
##INFO=<ID=Count_hom_sample
##INFO=<ID=Count_htz_sample
##INFO=<ID=Count_htz-allHom_sample
##INFO=<ID=Count_htz-total_cohort
##INFO=<ID=Count_total_cohort

Regarding the second error, I have found several lines are affected. Most of INFO entries are defined using "Number=1", however, they should specify "Number = ." as they present more than one value (I was working on a combined conversion mode).

I have found the following lines apparently affected:

##INFO=<ID=Annotation_mode,Number=1,Type=String,Description="Indicate the type of annotation lines generated: annotation on the SV full length ('full'), annotation on each gene overlapped by the SV ('split')">
##INFO=<ID=Tx,Number=1,Type=String,Description="Transcript symbol">
##INFO=<ID=Tx_start,Number=1,Type=Integer,Description="Starting position of the transcript">
##INFO=<ID=Tx_end,Number=1,Type=Integer,Description="Ending position of the transcript">
##INFO=<ID=Overlapped_tx_length,Number=1,Type=Integer,Description="Length of the transcript (bp) overlapping with the SV">
##INFO=<ID=Overlapped_CDS_length,Number=1,Type=Integer,Description="Length of the CoDing Sequence (CDS) (bp) overlapped with the SV">
##INFO=<ID=Overlapped_CDS_percent,Number=1,Type=Integer,Description="Percent of the CoDing Sequence (CDS) (bp) overlapped with the SV">
##INFO=<ID=Frameshift,Number=1,Type=String,Description="Indicates if the CDS length is not divisible by three (yes or no)">
##INFO=<ID=Exon_count,Number=1,Type=Integer,Description="Number of exons of the transcript">
##INFO=<ID=Location,Number=1,Type=String,Description="SV location in the gene's. Values: txStart, txEnd, exon'i', intron'i' e.g. txStart-exon3">
##INFO=<ID=Location2,Number=1,Type=String,Description="SV location in the gene's coding regions. Values: UTR (no CDS in the gene), 5'UTR (before the CDS start), 3'UTR (after the CDS end), CDS (between the CDS start and the CDS end, can be in an exon or an intron). e.g. 3'UTR-CDS">
##INFO=<ID=Dist_nearest_SS,Number=1,Type=Integer,Description="Absolute distance to nearest splice site after considering exonic and intronic SV breakpoints">
##INFO=<ID=Nearest_SS_type,Number=1,Type=String,Description="Nearest splice site type: 5' (donor) or 3' (acceptor)">
##INFO=<ID=Intersect_start,Number=1,Type=Integer,Description="Start position of the intersection between the SV and a transcript">
##INFO=<ID=Intersect_end,Number=1,Type=Integer,Description="End position of the intersection between the SV and a transcript">
##INFO=<ID=RE_gene,Number=1,Type=String,Description="Name of the genes regulated by a regulatory element overlapped with the SV to annotate. When available, the regulated gene name is detailed with associated haploinsufficiency (HI), triplosensitivity (TS), Exomiser (EX) scores, OMIM and candidate genes. (For the filtering output, see the -REselect1 and -REselect2 options)">
##INFO=<ID=TAD_coordinate,Number=1,Type=String,Description="Coordinates of the TAD whose boundaries overlapped with the annotated SV (boundaries included in the coordinates)">
##INFO=<ID=ENCODE_experiment,Number=1,Type=String,Description="ENCODE experiments used to define the TAD">
##INFO=<ID=Cosmic_ID,Number=1,Type=String,Description="COSMIC identifier">
##INFO=<ID=Repeat_coord_left,Number=1,Type=String,Description="Repeats coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=Repeat_type_left,Number=1,Type=String,Description="Repeats type around the left SV breakpoint (+/- 100bp) e.g. AluSp, L2b, L1PA2, LTR12C, SVA_D, ...">
##INFO=<ID=Repeat_coord_right,Number=1,Type=String,Description="Repeats coordinates around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=Repeat_type_right,Number=1,Type=String,Description="Repeats type around the right SV breakpoint (+/- 100bp) e.g. AluSp, L2b, L1PA2, LTR12C, SVA_D, ...">
##INFO=<ID=Gap_left,Number=1,Type=String,Description="Gap regions coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=Gap_right,Number=1,Type=String,Description="Gap regions coordinates around the right SV breakpoint (+/-100bp)">
##INFO=<ID=SegDup_left,Number=1,Type=String,Description="Segmental Duplication regions coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=SegDup_right,Number=1,Type=String,Description="Segmental Duplication regions coordinates around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_left,Number=1,Type=String,Description="ENCODE blacklist regions coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_characteristics_left,Number=1,Type=String,Description="ENCODE blacklist regions characteristics around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_right,Number=1,Type=String,Description="ENCODE blacklist regions coordinates around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_characteristics_right,Number=1,Type=String,Description="ENCODE blacklist regions characteristics around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=ACMG,Number=1,Type=String,Description="ACMG genes">
##INFO=<ID=HI,Number=1,Type=Integer,Description="ClinGen Haploinsufficiency Score">
##INFO=<ID=TS,Number=1,Type=Float,Description="ClinGen Triplosensitivity Score">
##INFO=<ID=GenCC_disease,Number=1,Type=String,Description="GenCC disease name: e.g. Nizon-Isidor syndrome">
##INFO=<ID=GenCC_moi,Number=1,Type=String,Description="GenCC mode of inheritance">
##INFO=<ID=GenCC_classification,Number=1,Type=String,Description="GenCC classification (Definitive, Strong, Moderate, Limited, Disputed, Animal Model Only, Refuted or No known disease relationship)">
##INFO=<ID=GenCC_pmid,Number=1,Type=String,Description="GenCC Pubmed Id">
##INFO=<ID=OMIM_phenotype,Number=1,Type=String,Description="e.g. Charcot-Marie-Tooth disease">
##INFO=<ID=OMIM_inheritance,Number=1,Type=String,Description="e.g. AD (= 'Autosomal dominant'). Detailed in AnnotSV's FAQ.">
##INFO=<ID=AnnotSV_ranking_criteria,Number=1,Type=String,Description="Decision criteria explaining the AnnotSV ranking score">
##INFO=<ID=ACMG_class,Number=1,Type=String,Description="SV ranking class into 1 of 5: class 1 (benign), class 2 (likely benign), class 3 (variant of unknown significance), class 4 (likely pathogenic), class 5 (pathogenic)">
##INFO=<ID=P_snvindel_phen,Number=1,Type=String,Description="Phenotypes of pathogenic snv/indel from public databases completely overlapped with the SV to annotate">

Turning "Number" to "." in all of them apparently gave me a valid VCF.
I just wanted to let you know in case others experience similar issues and to see if it is possible to amend this in the future.

Thanks!

@lgmgeo
Copy link
Owner

lgmgeo commented Oct 12, 2024

Hi @aligogon ,

The VCF output file is created by the variantconvert tool from the AnnotSV TSV output file.
May I ask you to post your bug report to https://github.com/SamuelNicaise/variantconvert/issues ?

Best regards,
Véronique

@aligogon
Copy link
Author

Thanks for your reply @lgmgeo ! I will post the issue there!

Best Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants