Update aggregate to use Dask dataframe #146

gwaybio · 2021-05-31T22:41:19Z

I'm experiencing what I believe (but am not 100% sure) are memory leaks when using pandas in aggregate.py. I think it is has to do with how I'm reusing the variable population_df several times. Plus, pandas has several documented memory leak issues and we've noticed at least one when using the recipe (see #142).

I am also running into long loading times in a separate project. Given that aggregate.py is essentially just taking the mean or median of all feature columns, it should be relatively straightforward to move to dask dataframe. This will also be a helpful switch in anticipation of pycytominer handling parquet files.

The text was updated successfully, but these errors were encountered:

gwaybio · 2021-06-01T15:06:17Z

it should be relatively straightforward to move to dask dataframe.

This turned out not to be the case. I tested this in a toy example with two dataframes written to temp files. One of the dataframe had missing values, both had the same columns with mixed dtypes. When trying to compute the mean (simulating aggregate.py), I received this error:

~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    665     def _set_axis(self, axis: int, labels: Index) -> None:
    666         labels = ensure_index(labels)
--> 667         self._mgr.set_axis(axis, labels)
    668         self._clear_item_cache()
    669 

~/miniconda3/envs/pycytominer-test/lib/python3.8/site-packages/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
    218 
    219         if new_len != old_len:
--> 220             raise ValueError(
    221                 f"Length mismatch: Expected axis has {old_len} elements, new "
    222                 f"values have {new_len} elements"

This error results from the two files having different metadata columns.

I then did some digging and determined that for aggregating single cell output from CellProfiler, dask is not a straightfoward solution.

Dask requires that all CSV files have uniform structure.
- This is not guaranteed for CellProfiler output. Missing values in otherwise int columns, different column order across files, metadata with different dtypes are relatively common.
- See ValueError: Length mismatch with added/missing columns dask/dask#2752. It appears that d6tstack might be a useful intermediate step, but it might involve data redundancy after organization.

It might still be worth adding an implementation to read files from multiple csv locations - worth investigating a bit further for the pooled cell painting project.

I also still need to figure out if aggregate is causing a memory leak, and to fix cyclical variable assignment

gwaybio · 2021-06-04T17:02:15Z

this appears to only be a problem for me 😂

some differences I can see between my implementation and Niranj or Beths is that I am using gzipped csv files with mtime=0. Perhaps it is reading these kinds of files specifically that is causing an issue 🤔

gwaybio added enhancement New feature or request high priority Needs immediate attention labels May 31, 2021

gwaybio removed the high priority Needs immediate attention label Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update aggregate to use Dask dataframe #146

Update aggregate to use Dask dataframe #146

gwaybio commented May 31, 2021

gwaybio commented Jun 1, 2021 •

edited

Loading

gwaybio commented Jun 4, 2021

Update aggregate to use Dask dataframe #146

Update aggregate to use Dask dataframe #146

Comments

gwaybio commented May 31, 2021

gwaybio commented Jun 1, 2021 • edited Loading

gwaybio commented Jun 4, 2021

gwaybio commented Jun 1, 2021 •

edited

Loading