Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update aggregate to use Dask dataframe #146

Open
gwaybio opened this issue May 31, 2021 · 2 comments
Open

Update aggregate to use Dask dataframe #146

gwaybio opened this issue May 31, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@gwaybio
Copy link
Member

gwaybio commented May 31, 2021

I'm experiencing what I believe (but am not 100% sure) are memory leaks when using pandas in aggregate.py. I think it is has to do with how I'm reusing the variable population_df several times. Plus, pandas has several documented memory leak issues and we've noticed at least one when using the recipe (see #142).

I am also running into long loading times in a separate project. Given that aggregate.py is essentially just taking the mean or median of all feature columns, it should be relatively straightforward to move to dask dataframe. This will also be a helpful switch in anticipation of pycytominer handling parquet files.

@gwaybio gwaybio added enhancement New feature or request high priority Needs immediate attention labels May 31, 2021
@gwaybio
Copy link
Member Author

gwaybio commented Jun 1, 2021

it should be relatively straightforward to move to dask dataframe.

This turned out not to be the case. I tested this in a toy example with two dataframes written to temp files. One of the dataframe had missing values, both had the same columns with mixed dtypes. When trying to compute the mean (simulating aggregate.py), I received this error:

~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    665     def _set_axis(self, axis: int, labels: Index) -> None:
    666         labels = ensure_index(labels)
--> 667         self._mgr.set_axis(axis, labels)
    668         self._clear_item_cache()
    669 

~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
    218 
    219         if new_len != old_len:
--> 220             raise ValueError(
    221                 f"Length mismatch: Expected axis has {old_len} elements, new "
    222                 f"values have {new_len} elements"

This error results from the two files having different metadata columns.

I then did some digging and determined that for aggregating single cell output from CellProfiler, dask is not a straightfoward solution.

  • Dask requires that all CSV files have uniform structure.

It might still be worth adding an implementation to read files from multiple csv locations - worth investigating a bit further for the pooled cell painting project.

I also still need to figure out if aggregate is causing a memory leak, and to fix cyclical variable assignment

@gwaybio
Copy link
Member Author

gwaybio commented Jun 4, 2021

this appears to only be a problem for me 😂

some differences I can see between my implementation and Niranj or Beths is that I am using gzipped csv files with mtime=0. Perhaps it is reading these kinds of files specifically that is causing an issue 🤔

@gwaybio gwaybio removed the high priority Needs immediate attention label Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant