Conversion to Pandas in read_gtf has missing dependency #32

acrinklaw · 2023-02-07T23:18:51Z

Just installed gtfparse on an ec2 instance today and ran into an issue where pyarrow is a required dependency but not declared or pulled in during setup. Installing pyarrow fixes the issue.

In [5]: df = read_gtf("GCF_000001405.39_GRCh38.p13_genomic.gtf", result_type="pandas")
INFO:root:Extracted GTF attributes: ['gene_id', 'Dbxref', 'ID', 'Name', 'gbkey', 'gene', 'gene_biotype', 'transcript_id', 'Parent', 'model_evidence', 'original_biotype', 'product', 'description', 'partial', 'Note', 'exception', 'inference', 'end_range', 'start_range', 'gene_synonym', 'protein_id', 'tag', 'pseudo', 'The', 'transl_except', 'anticodon', 'standard_name', 'non-AUG', 'codons', '12S', '16S', 'transl_table', 'ATPase', 'isoform', 'similar', 'exon_number', 'number']
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df = read_gtf("GCF_000001405.39_GRCh38.p13_genomic.gtf", result_type="pandas")

File /mnt/data/miniconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:292, in read_gtf(filepath_or_buffer, expand_attribute_column, infer_biotype_column, column_converters, usecols, features, result_type)
    289     result_df = result_df.select(valid_columns)
    291 if result_type == "pandas":
--> 292     result = result_df.to_pandas()
    293 elif result_type == "polars":
    294     result = result_df

File /mnt/data/miniconda3/lib/python3.9/site-packages/polars/internals/dataframe/frame.py:1962, in DataFrame.to_pandas(self, date_as_object, *args, **kwargs)
   1925 def to_pandas(
   1926     self, *args: Any, date_as_object: bool = False, **kwargs: Any
   1927 ) -> pd.DataFrame:
   1928     """
   1929     Cast to a pandas DataFrame.
   1930
   (...)
   1960
   1961     """
-> 1962     record_batches = self._df.to_pandas()
   1963     tbl = pa.Table.from_batches(record_batches)
   1964     return tbl.to_pandas(*args, date_as_object=date_as_object, **kwargs)

ModuleNotFoundError: No module named 'pyarrow'

Fantastic package by the way, I use it in quite a few different workflows. Please let me know if you have any outstanding ideas or issues you want some help with, I'd be happy to contribute!

acrinklaw mentioned this issue Feb 10, 2023

Add pyarrow to dependencies list #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion to Pandas in read_gtf has missing dependency #32

Conversion to Pandas in read_gtf has missing dependency #32

acrinklaw commented Feb 7, 2023 •

edited

Loading

Conversion to Pandas in read_gtf has missing dependency #32

Conversion to Pandas in read_gtf has missing dependency #32

Comments

acrinklaw commented Feb 7, 2023 • edited Loading

acrinklaw commented Feb 7, 2023 •

edited

Loading