Pysam UnicodeDecodeError when loading with tabixed VCF #139

dgomezpere · 2020-09-14T16:50:10Z

vcfpy version: 0.13.2
Python version: 3.6.9 64bit [GCC 8.4.0]
Operating System: Linux 4.15.0 1093 oem x86_64 with Ubuntu 18.04 bionic

Description

When I fetch variants by contig ID I get the following UnicodeDecodeError demosntrating some issues when parsing the tabix file. Maybe the issue comes from pysam, but I would like to know if you have had previous reports based on this issue.

What I Did

Tabix VCF file

$ tabix -p vcf <vcf_filepath>

reader = vcfpy.Reader.from_path(path=DATA['annot_vcf'], tabix_path=DATA['annot_vcf']+'.tbi')
for record in reader.fetch('chr1'):
    [...]

Traceback Error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-046818a3e579> in <module>
      3 variant_records = []
      4 sample_records = []
----> 5 for record in reader.fetch('chr1'):
      6     if record.CHROM in wanted_chroms:
      7         ALT = record.ALT[0].value

/usr/local/lib/python3.6/dist-packages/vcfpy/reader.py in __next__(self)
    171         """
    172         if self.tabix_iter:
--> 173             return self.parser.parse_line(str(next(self.tabix_iter)))
    174         else:
    175             result = self.parser.parse_next_record()

pysam/libctabix.pyx in pysam.libctabix.TabixIterator.__next__()

pysam/libcutils.pyx in pysam.libcutils.charptr_to_str()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2821: ordinal not in range(128)

The text was updated successfully, but these errors were encountered:

holtgrewe · 2020-09-14T20:04:27Z

Interesting, what is your locale setting? C? What happens if you set export LC_ALL=en_US.UTF-8 or similar?

dgomezpere · 2020-09-14T21:57:02Z

Hi @holtgrewe !!
My locale settings are already in en_US.UTF-8:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

dgomezpere · 2020-09-16T14:46:55Z

Any other idea about the issue @holtgrewe??
Thanks in advance!!

holtgrewe · 2020-09-16T15:24:23Z

It looks like that you have non-ASCII unicode in your VCF file and pysam is stumbling over this...

holtgrewe · 2020-09-16T15:49:20Z

Hm, I don't remember why I was using pysam in favour of pytabix. I don't know whether that is more robust... Hm, one could try to replace the tabix part of pysam with pytabix in vcfpy...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pysam UnicodeDecodeError when loading with tabixed VCF #139

Pysam UnicodeDecodeError when loading with tabixed VCF #139

dgomezpere commented Sep 14, 2020

holtgrewe commented Sep 14, 2020

dgomezpere commented Sep 14, 2020

dgomezpere commented Sep 16, 2020

holtgrewe commented Sep 16, 2020

holtgrewe commented Sep 16, 2020

Pysam UnicodeDecodeError when loading with tabixed VCF #139

Pysam UnicodeDecodeError when loading with tabixed VCF #139

Comments

dgomezpere commented Sep 14, 2020

Description

What I Did

Traceback Error

holtgrewe commented Sep 14, 2020

dgomezpere commented Sep 14, 2020

dgomezpere commented Sep 16, 2020

holtgrewe commented Sep 16, 2020

holtgrewe commented Sep 16, 2020