Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion to Pandas in read_gtf has missing dependency #32

Open
acrinklaw opened this issue Feb 7, 2023 · 0 comments
Open

Conversion to Pandas in read_gtf has missing dependency #32

acrinklaw opened this issue Feb 7, 2023 · 0 comments

Comments

@acrinklaw
Copy link

acrinklaw commented Feb 7, 2023

Just installed gtfparse on an ec2 instance today and ran into an issue where pyarrow is a required dependency but not declared or pulled in during setup. Installing pyarrow fixes the issue.

In [5]: df = read_gtf("GCF_000001405.39_GRCh38.p13_genomic.gtf", result_type="pandas")
INFO:root:Extracted GTF attributes: ['gene_id', 'Dbxref', 'ID', 'Name', 'gbkey', 'gene', 'gene_biotype', 'transcript_id', 'Parent', 'model_evidence', 'original_biotype', 'product', 'description', 'partial', 'Note', 'exception', 'inference', 'end_range', 'start_range', 'gene_synonym', 'protein_id', 'tag', 'pseudo', 'The', 'transl_except', 'anticodon', 'standard_name', 'non-AUG', 'codons', '12S', '16S', 'transl_table', 'ATPase', 'isoform', 'similar', 'exon_number', 'number']
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df = read_gtf("GCF_000001405.39_GRCh38.p13_genomic.gtf", result_type="pandas")

File /mnt/data/miniconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:292, in read_gtf(filepath_or_buffer, expand_attribute_column, infer_biotype_column, column_converters, usecols, features, result_type)
    289     result_df = result_df.select(valid_columns)
    291 if result_type == "pandas":
--> 292     result = result_df.to_pandas()
    293 elif result_type == "polars":
    294     result = result_df

File /mnt/data/miniconda3/lib/python3.9/site-packages/polars/internals/dataframe/frame.py:1962, in DataFrame.to_pandas(self, date_as_object, *args, **kwargs)
   1925 def to_pandas(
   1926     self, *args: Any, date_as_object: bool = False, **kwargs: Any
   1927 ) -> pd.DataFrame:
   1928     """
   1929     Cast to a pandas DataFrame.
   1930
   (...)
   1960
   1961     """
-> 1962     record_batches = self._df.to_pandas()
   1963     tbl = pa.Table.from_batches(record_batches)
   1964     return tbl.to_pandas(*args, date_as_object=date_as_object, **kwargs)

ModuleNotFoundError: No module named 'pyarrow'

Fantastic package by the way, I use it in quite a few different workflows. Please let me know if you have any outstanding ideas or issues you want some help with, I'd be happy to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant