Changes to parser.py and model.py to allow for nonstandard VCF file headers #326
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is in regards to issue "Custom section delimited by ";" and not ":" #325
As it turns out, the issue was not the demilmiter, it was that the VCF file had additional non-standard headers and columns that came after the Genotype fields. These columns contained info similarly formatted like the INFO column. It contained data with colons (such as
GENEINFO=PRDM2:7799
) which the Genotype parser attempted to add into the Genotype "samples" list. Since it didn't match the expected format of "GT:GL:GOF:GQ:NR:NV" the code crashed (it split the line on at the GENEINFO=PRDM2[\n]7799` creating a garbage row it couldn't parse.)I modified the code to accept VCF files with additional non-standard headers/columns. In
parser.py
,__next__()
method you used a list "row" which is accessed by hardcoded indexes (row[0] - row[9]
). I replaced this with a dictionary which is derived by parsing the headers row (identified by starting with#CHROM
). It then uses the headers as dict keys, and (for each row) creates a "rowdict" variable containing "header1":"col_1_data", "header2":"col_2_data". This way, additional nonstandard columns don't crash the software.However, I did not go so far as to modify the
_parse_samples()
method, so (right now, in this code) the additional columns are still being added to the Genotype samples.I just thought I would pass this code up as a pull request so you could decide if there's any value for it in the master PyVCF.
Thanks!
Mike