Invalid VCF according to VCF specification #38

aligogon · 2024-11-12T16:12:25Z

Hi! @SamuelNicaise
I am reposting an issue I opened at lgmgeo/AnnotSV#261

I copy here the details:

After obtaining a VCF from AnnotSV I used EBI VCF validator (https://github.com/EBIvariation/vcf-validator) obtaining some errors:

According to the VCF specification, the input file is not valid
Error: Metadata ID contains a character different from alphanumeric, dot, underscore and dash. This occurs 7 time(s), first time in line 19.
Error: INFO ACMG_class does not match the meta specification Number=1 (expected 1 value(s)). This occurs 6 time(s), first time in line 349.

Regarding the first error, I have corrected it by modifying the following IDs: of the header:

##INFO=<ID='SampleID'
##INFO=<ID=Compound_htz(sample)
##INFO=<ID=Count_hom(sample)
##INFO=<ID=Count_htz(sample)
##INFO=<ID=Count_htz/allHom(sample)
##INFO=<ID=Count_htz/total(cohort)
##INFO=<ID=Count_total(cohort)

Into these:

##INFO=<ID=SampleID
##INFO=<ID=Compound_htz_sample
##INFO=<ID=Count_hom_sample
##INFO=<ID=Count_htz_sample
##INFO=<ID=Count_htz-allHom_sample
##INFO=<ID=Count_htz-total_cohort
##INFO=<ID=Count_total_cohort

Regarding the second error, I have found several lines are affected. Most of INFO entries are defined using "Number=1", however, they should specify "Number = ." as they present more than one value (I was working on a combined conversion mode).

I have found the following lines apparently affected:

##INFO=<ID=Annotation_mode,Number=1,Type=String,Description="Indicate the type of annotation lines generated: annotation on the SV full length ('full'), annotation on each gene overlapped by the SV ('split')">
##INFO=<ID=Tx,Number=1,Type=String,Description="Transcript symbol">
##INFO=<ID=Tx_start,Number=1,Type=Integer,Description="Starting position of the transcript">
##INFO=<ID=Tx_end,Number=1,Type=Integer,Description="Ending position of the transcript">
##INFO=<ID=Overlapped_tx_length,Number=1,Type=Integer,Description="Length of the transcript (bp) overlapping with the SV">
##INFO=<ID=Overlapped_CDS_length,Number=1,Type=Integer,Description="Length of the CoDing Sequence (CDS) (bp) overlapped with the SV">
##INFO=<ID=Overlapped_CDS_percent,Number=1,Type=Integer,Description="Percent of the CoDing Sequence (CDS) (bp) overlapped with the SV">
##INFO=<ID=Frameshift,Number=1,Type=String,Description="Indicates if the CDS length is not divisible by three (yes or no)">
##INFO=<ID=Exon_count,Number=1,Type=Integer,Description="Number of exons of the transcript">
##INFO=<ID=Location,Number=1,Type=String,Description="SV location in the gene's. Values: txStart, txEnd, exon'i', intron'i' e.g. txStart-exon3">
##INFO=<ID=Location2,Number=1,Type=String,Description="SV location in the gene's coding regions. Values: UTR (no CDS in the gene), 5'UTR (before the CDS start), 3'UTR (after the CDS end), CDS (between the CDS start and the CDS end, can be in an exon or an intron). e.g. 3'UTR-CDS">
##INFO=<ID=Dist_nearest_SS,Number=1,Type=Integer,Description="Absolute distance to nearest splice site after considering exonic and intronic SV breakpoints">
##INFO=<ID=Nearest_SS_type,Number=1,Type=String,Description="Nearest splice site type: 5' (donor) or 3' (acceptor)">
##INFO=<ID=Intersect_start,Number=1,Type=Integer,Description="Start position of the intersection between the SV and a transcript">
##INFO=<ID=Intersect_end,Number=1,Type=Integer,Description="End position of the intersection between the SV and a transcript">
##INFO=<ID=RE_gene,Number=1,Type=String,Description="Name of the genes regulated by a regulatory element overlapped with the SV to annotate. When available, the regulated gene name is detailed with associated haploinsufficiency (HI), triplosensitivity (TS), Exomiser (EX) scores, OMIM and candidate genes. (For the filtering output, see the -REselect1 and -REselect2 options)">
##INFO=<ID=TAD_coordinate,Number=1,Type=String,Description="Coordinates of the TAD whose boundaries overlapped with the annotated SV (boundaries included in the coordinates)">
##INFO=<ID=ENCODE_experiment,Number=1,Type=String,Description="ENCODE experiments used to define the TAD">
##INFO=<ID=Cosmic_ID,Number=1,Type=String,Description="COSMIC identifier">
##INFO=<ID=Repeat_coord_left,Number=1,Type=String,Description="Repeats coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=Repeat_type_left,Number=1,Type=String,Description="Repeats type around the left SV breakpoint (+/- 100bp) e.g. AluSp, L2b, L1PA2, LTR12C, SVA_D, ...">
##INFO=<ID=Repeat_coord_right,Number=1,Type=String,Description="Repeats coordinates around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=Repeat_type_right,Number=1,Type=String,Description="Repeats type around the right SV breakpoint (+/- 100bp) e.g. AluSp, L2b, L1PA2, LTR12C, SVA_D, ...">
##INFO=<ID=Gap_left,Number=1,Type=String,Description="Gap regions coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=Gap_right,Number=1,Type=String,Description="Gap regions coordinates around the right SV breakpoint (+/-100bp)">
##INFO=<ID=SegDup_left,Number=1,Type=String,Description="Segmental Duplication regions coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=SegDup_right,Number=1,Type=String,Description="Segmental Duplication regions coordinates around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_left,Number=1,Type=String,Description="ENCODE blacklist regions coordinates around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_characteristics_left,Number=1,Type=String,Description="ENCODE blacklist regions characteristics around the left SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_right,Number=1,Type=String,Description="ENCODE blacklist regions coordinates around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=ENCODE_blacklist_characteristics_right,Number=1,Type=String,Description="ENCODE blacklist regions characteristics around the right SV breakpoint (+/- 100bp)">
##INFO=<ID=ACMG,Number=1,Type=String,Description="ACMG genes">
##INFO=<ID=HI,Number=1,Type=Integer,Description="ClinGen Haploinsufficiency Score">
##INFO=<ID=TS,Number=1,Type=Float,Description="ClinGen Triplosensitivity Score">
##INFO=<ID=GenCC_disease,Number=1,Type=String,Description="GenCC disease name: e.g. Nizon-Isidor syndrome">
##INFO=<ID=GenCC_moi,Number=1,Type=String,Description="GenCC mode of inheritance">
##INFO=<ID=GenCC_classification,Number=1,Type=String,Description="GenCC classification (Definitive, Strong, Moderate, Limited, Disputed, Animal Model Only, Refuted or No known disease relationship)">
##INFO=<ID=GenCC_pmid,Number=1,Type=String,Description="GenCC Pubmed Id">
##INFO=<ID=OMIM_phenotype,Number=1,Type=String,Description="e.g. Charcot-Marie-Tooth disease">
##INFO=<ID=OMIM_inheritance,Number=1,Type=String,Description="e.g. AD (= 'Autosomal dominant'). Detailed in AnnotSV's FAQ.">
##INFO=<ID=AnnotSV_ranking_criteria,Number=1,Type=String,Description="Decision criteria explaining the AnnotSV ranking score">
##INFO=<ID=ACMG_class,Number=1,Type=String,Description="SV ranking class into 1 of 5: class 1 (benign), class 2 (likely benign), class 3 (variant of unknown significance), class 4 (likely pathogenic), class 5 (pathogenic)">
##INFO=<ID=P_snvindel_phen,Number=1,Type=String,Description="Phenotypes of pathogenic snv/indel from public databases completely overlapped with the SV to annotate">

Turning "Number" to "." in all of them apparently gave me a valid VCF.
I just wanted to let you know in case others experience similar issues and to see if it is possible to amend this in the future.

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid VCF according to VCF specification #38

Invalid VCF according to VCF specification #38

aligogon commented Nov 12, 2024

Invalid VCF according to VCF specification #38

Invalid VCF according to VCF specification #38

Comments

aligogon commented Nov 12, 2024