Changes to parser.py and model.py to allow for nonstandard VCF file headers #326

BioComSoftware · 2020-08-13T13:57:15Z

This is in regards to issue "Custom section delimited by ";" and not ":" #325

As it turns out, the issue was not the demilmiter, it was that the VCF file had additional non-standard headers and columns that came after the Genotype fields. These columns contained info similarly formatted like the INFO column. It contained data with colons (such as GENEINFO=PRDM2:7799) which the Genotype parser attempted to add into the Genotype "samples" list. Since it didn't match the expected format of "GT:GL:GOF:GQ:NR:NV" the code crashed (it split the line on at the GENEINFO=PRDM2[\n]7799` creating a garbage row it couldn't parse.)

I modified the code to accept VCF files with additional non-standard headers/columns. In parser.py, __next__() method you used a list "row" which is accessed by hardcoded indexes (row[0] - row[9]). I replaced this with a dictionary which is derived by parsing the headers row (identified by starting with #CHROM). It then uses the headers as dict keys, and (for each row) creates a "rowdict" variable containing "header1":"col_1_data", "header2":"col_2_data". This way, additional nonstandard columns don't crash the software.

However, I did not go so far as to modify the _parse_samples() method, so (right now, in this code) the additional columns are still being added to the Genotype samples.

I just thought I would pass this code up as a pull request so you could decide if there's any value for it in the master PyVCF.

Thanks!
Mike

…eaders.

… CHROM, POS, REF; and ALT.)

…ter the first row of data is read. This way, non-standard columns of data can be inced or excluded as appropriate.

BioComSoftware · 2020-08-14T13:33:21Z

I changed to code to accommodate the non-additional standard columns as they relate to the Genotype data (self.samples, and self._sample_indexes.)

Now the code will wait to set self.samples, and self._sample_indexes until after the first row of data is read. It then uses REGEX to determine which of the additional columns should be added to the samples. Any additional columns which do not match the FORMAT column are NOT added to self.samples - however they are still retained in the rowdict() so the data is accessible.

I'm considering having it parse all these additional columns of data and - if they are simply key:value pairs - have them automagically added to the INFO column.

I don't know if these changes will come with this existing pull request, or if I'd need to make another one to include them. Again, feel free to add this to the PyVCF code if you think it's useful. I just made these mods because it works with the non-standard VCF files being used by DKFZ.

Best!

…gs to use rowdict.

DKFZ-UNITE-Administration added 3 commits August 13, 2020 15:39

Changes to parser.py and model.py to allow for nonstandard VCF file h…

0a9e0af

…eaders.

Adjusted model.py _record.str() to respond with all columns, not just…

600a429

… CHROM, POS, REF; and ALT.)

Moved the enumeration of self.samples and self._samples_indexes to af…

7a1c849

…ter the first row of data is read. This way, non-standard columns of data can be inced or excluded as appropriate.

Corrected little bump I caused in _parse_samples when I modified thin…

91f3bd2

…gs to use rowdict.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to parser.py and model.py to allow for nonstandard VCF file headers #326

Changes to parser.py and model.py to allow for nonstandard VCF file headers #326

BioComSoftware commented Aug 13, 2020 •

edited

Loading

BioComSoftware commented Aug 14, 2020

Changes to parser.py and model.py to allow for nonstandard VCF file headers #326

Are you sure you want to change the base?

Changes to parser.py and model.py to allow for nonstandard VCF file headers #326

Conversation

BioComSoftware commented Aug 13, 2020 • edited Loading

BioComSoftware commented Aug 14, 2020

BioComSoftware commented Aug 13, 2020 •

edited

Loading