Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANDAS - EOF inside string starting at row #36

Open
GACGAMA opened this issue Jul 9, 2024 · 0 comments
Open

PANDAS - EOF inside string starting at row #36

GACGAMA opened this issue Jul 9, 2024 · 0 comments

Comments

@GACGAMA
Copy link

GACGAMA commented Jul 9, 2024

Hello!
I'm using AnnotSV to annotate a CNVkit VCF.
When trying to convert the AnnotSV TSV file, I get this error for all my vcfs, in some line:


 variantconvert convert -i /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/BH14418_TUMOR_call_no_theta_.cnv.vcf.tsv -o /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/BH14418_TUMOR_call_no_theta_.cnv.vcf -c GRCh38/annotsv2.json
2024-07-09 14:24:20 [INFO] running variantconvert 2.0.1
Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/annotsv/bin/variantconvert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 222, in main
    main_convert(args)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 77, in main_convert
    converter.convert(args.inputFile, args.outputFile)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 398, in convert
    self.input_df = self._build_input_dataframe()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 44, in _build_input_dataframe
    df = pd.read_csv(
         ^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 239, in read
    data = self._reader.read(nrows)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 820, in pandas._libs.parsers.TextReader.read
  File "parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 8112
(annotsv) [ggama1@c003 GRCh38]$ variantconvert convert -i /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/annot_sample_tumor.vcf.tsv -o /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/annot_sample_tumor.vcf -c GRCh38/annotsv2.json
2024-07-09 14:25:08 [INFO] running variantconvert 2.0.1
Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/annotsv/bin/variantconvert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 222, in main
    main_convert(args)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 77, in main_convert
    converter.convert(args.inputFile, args.outputFile)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 398, in convert
    self.input_df = self._build_input_dataframe()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 44, in _build_input_dataframe
    df = pd.read_csv(
         ^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 239, in read
    data = self._reader.read(nrows)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 820, in pandas._libs.parsers.TextReader.read
  File "parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 1447

This error is hapenning to varianconvert 2.0.1 installed with conda and also when installed with pip inside a virtual conda env.

Attached is the file used.

annot_sample_tumor.vcf.tsv.txt

I've also found this blog
https://www.shanelynn.ie/pandas-csv-error-error-tokenizing-data-c-error-eof-inside-string-starting-at-line/
As it seems to be related to reading the file by using C instead of python, which seems to cause the problem.

I've tried to modify vcf_from_annotsv.py by adding:

5 import csv
6 import sys
7 csv.field_size_limit(sys.maxsize)
48 df = pd.read_csv(
            self.filepath,
            skiprows=self.config["GENERAL"]["skip_rows"],
            sep="\t",
            engine='python',
            encoding='utf-8', 
            on_bad_lines='skip',
)

But this gives other errors:

Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/annotsv/bin/variantconvert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 222, in main
    main_convert(args)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 77, in main_convert
    converter.convert(args.inputFile, args.outputFile)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 404, in convert
    self.input_df = self._build_input_dataframe()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 48, in _build_input_dataframe
    df = pd.read_csv(
         ^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 252, in read
    content = self._get_lines(rows)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 1140, in _get_lines
    next_row = self._next_iter_line(row_num=self.pos + rows + 1)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 834, in _next_iter_line
    self._alert_malformed(msg, row_num)
  File "/home/ggama1/.conda/envs/annotsv/lib/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 781, in _alert_malformed
    raise ParserError(msg)
pandas.errors.ParserError: unexpected end of data
(annotsv) [ggama1@c003 GRCh38]$ variantconvert convert -i /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/annot_sample_tumor.vcf.tsv -o /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/annot_sample_tumor.vcf -c GRCh38/annotsv2.json
2024-07-09 14:42:42 [INFO] running variantconvert 2.0.1
Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/annotsv/bin/variantconvert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 222, in main
    main_convert(args)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/__main__.py", line 77, in main_convert
    converter.convert(args.inputFile, args.outputFile)
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 404, in convert
    self.input_df = self._build_input_dataframe()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch4/nsobrei2/programs/variantconvert/src/variantconvert/converters/vcf_from_annotsv.py", line 57, in _build_input_dataframe
    [self.config["VCF_COLUMNS"]["#CHROM"], self.config["VCF_COLUMNS"]["INFO"]["SV_start"]],
                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'SV_start'

Another proposed solution would be:

5 import csv
46 df = pd.read_csv(
            self.filepath,
            skiprows=self.config["GENERAL"]["skip_rows"],
            sep="\t", 
            low_memory=False,
            quoting=csv.QUOTE_NONE,
        )

Which gives me the same error for SV_Start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant