-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update aggregate to use Dask dataframe #146
Comments
This turned out not to be the case. I tested this in a toy example with two dataframes written to temp files. One of the dataframe had missing values, both had the same columns with mixed dtypes. When trying to compute the mean (simulating aggregate.py), I received this error: ~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
665 def _set_axis(self, axis: int, labels: Index) -> None:
666 labels = ensure_index(labels)
--> 667 self._mgr.set_axis(axis, labels)
668 self._clear_item_cache()
669
~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
218
219 if new_len != old_len:
--> 220 raise ValueError(
221 f"Length mismatch: Expected axis has {old_len} elements, new "
222 f"values have {new_len} elements" This error results from the two files having different metadata columns. I then did some digging and determined that for aggregating single cell output from CellProfiler, dask is not a straightfoward solution.
It might still be worth adding an implementation to read files from multiple csv locations - worth investigating a bit further for the pooled cell painting project. I also still need to figure out if aggregate is causing a memory leak, and to fix cyclical variable assignment |
this appears to only be a problem for me 😂 some differences I can see between my implementation and Niranj or Beths is that I am using gzipped csv files with |
I'm experiencing what I believe (but am not 100% sure) are memory leaks when using pandas in
aggregate.py
. I think it is has to do with how I'm reusing the variablepopulation_df
several times. Plus, pandas has several documented memory leak issues and we've noticed at least one when using the recipe (see #142).I am also running into long loading times in a separate project. Given that
aggregate.py
is essentially just taking the mean or median of all feature columns, it should be relatively straightforward to move to dask dataframe. This will also be a helpful switch in anticipation of pycytominer handling parquet files.The text was updated successfully, but these errors were encountered: