Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to parser.py and model.py to allow for nonstandard VCF file headers #326

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

BioComSoftware
Copy link

@BioComSoftware BioComSoftware commented Aug 13, 2020

This is in regards to issue "Custom section delimited by ";" and not ":" #325

As it turns out, the issue was not the demilmiter, it was that the VCF file had additional non-standard headers and columns that came after the Genotype fields. These columns contained info similarly formatted like the INFO column. It contained data with colons (such as GENEINFO=PRDM2:7799) which the Genotype parser attempted to add into the Genotype "samples" list. Since it didn't match the expected format of "GT:GL:GOF:GQ:NR:NV" the code crashed (it split the line on at the GENEINFO=PRDM2[\n]7799` creating a garbage row it couldn't parse.)

I modified the code to accept VCF files with additional non-standard headers/columns. In parser.py, __next__() method you used a list "row" which is accessed by hardcoded indexes (row[0] - row[9]). I replaced this with a dictionary which is derived by parsing the headers row (identified by starting with #CHROM). It then uses the headers as dict keys, and (for each row) creates a "rowdict" variable containing "header1":"col_1_data", "header2":"col_2_data". This way, additional nonstandard columns don't crash the software.

However, I did not go so far as to modify the _parse_samples() method, so (right now, in this code) the additional columns are still being added to the Genotype samples.

I just thought I would pass this code up as a pull request so you could decide if there's any value for it in the master PyVCF.

Thanks!
Mike

…ter the first row of data is read. This way, non-standard columns of data can be inced or excluded as appropriate.
@BioComSoftware
Copy link
Author

I changed to code to accommodate the non-additional standard columns as they relate to the Genotype data (self.samples, and self._sample_indexes.)

Now the code will wait to set self.samples, and self._sample_indexes until after the first row of data is read. It then uses REGEX to determine which of the additional columns should be added to the samples. Any additional columns which do not match the FORMAT column are NOT added to self.samples - however they are still retained in the rowdict() so the data is accessible.

I'm considering having it parse all these additional columns of data and - if they are simply key:value pairs - have them automagically added to the INFO column.

I don't know if these changes will come with this existing pull request, or if I'd need to make another one to include them. Again, feel free to add this to the PyVCF code if you think it's useful. I just made these mods because it works with the non-standard VCF files being used by DKFZ.

Best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants