You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A sample column with a value of 0|0, is not being parsed correctly with FORMAT "GT" with the GT FieldInfo not specified in the header.
Using a file with 11 columns, FORMAT="GT", "SAMPLE1" = "0|0", "SAMPLE2" = "1|0" the parser includes erroneous list artifacts:
ValueError: invalid literal for int() with base 10: "['0"
What I Did
import vcfpy
path = '/path/to/file.vcf'
>>> reader = vcfpy.Reader.from_path(path)
>>> for record in reader:
... # do work
Stack Trace:
python3.12/site-packages/vcfpy/header.py:413: FieldInfoNotFound: FORMAT GT not found using String/'.' instead
warnings.warn(
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "python3.12/site-packages/vcfpy/reader.py", line 175, in __next__
result = self.parser.parse_next_record()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/vcfpy/parser.py", line 804, in parse_next_record
return self.parse_line(self._read_next_line())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/vcfpy/parser.py", line 795, in parse_line
return self._record_parser.parse_line(line)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/vcfpy/parser.py", line 467, in parse_line
calls = self._handle_calls(alts, format_, arr[8], arr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/vcfpy/parser.py", line 481, in _handle_calls
call = record.Call(sample, data)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/vcfpy/record.py", line 236, in __init__
self._genotype_updated()
File "python3.12/site-packages/vcfpy/record.py", line 259, in _genotype_updated
self.gt_alleles.append(int(allele))
^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: "['0"
The issue here appears to be around the function parse_field_value in vcfpy/parser.py. The default behavior is split values on ',', and then return an array of those converted values. The issue happens when, as in the test data set, there is no ',' character to split on and therefore an an array of length 1 is returned. This value is then used in vcfpy/record.py "_genotype_updated, which is passed into the regex split for allele in ALLELE_DELIM.split(str(self.data["GT"])):, which again is not splitting on a string but on a list of strings of length 1 - causing the regex split to return the list type artifact [.
parser.parse_field_value could return a single string if the length is == 1, as opposed to a list of length one, or (probably a safer change) record._genotype_updated could check if the value of self.data["GT"] is an array, as opposed to assuming it is simply a string.
The text was updated successfully, but these errors were encountered:
JarvisVon
changed the title
A sample column value of 0|0, is not being parsed correctly
A sample column value of 0|0 is not being parsed correctly
Jun 24, 2024
Description
A sample column with a value of
0|0
, is not being parsed correctly withFORMAT
"GT" with theGT
FieldInfo not specified in the header.Using a file with 11 columns, FORMAT="GT", "SAMPLE1" = "0|0", "SAMPLE2" = "1|0" the parser includes erroneous list artifacts:
ValueError: invalid literal for int() with base 10: "['0"
What I Did
The issue here appears to be around the function
parse_field_value
in vcfpy/parser.py. The default behavior is split values on ',', and then return an array of those converted values. The issue happens when, as in the test data set, there is no ',' character to split on and therefore an an array of length 1 is returned. This value is then used in vcfpy/record.py "_genotype_updated, which is passed into the regex splitfor allele in ALLELE_DELIM.split(str(self.data["GT"])):
, which again is not splitting on a string but on a list of strings of length 1 - causing the regex split to return the list type artifact[
.parser.parse_field_value
could return a single string if the length is == 1, as opposed to a list of length one, or (probably a safer change)record._genotype_updated
could check if the value ofself.data["GT"]
is an array, as opposed to assuming it is simply a string.The text was updated successfully, but these errors were encountered: