Code refactor to limit memory usage? #152

sjfleming · 2021-07-01T21:47:13Z

Hi all,

I've been working with Carmen @carmendv from the Precision Cardiology Lab on using pycytominer to aggregate and normalize data which has been saved in an SQLite database format by running

cytominer-database ingest /data/directory sqlite:///backend.sqlite -c ingest_config.ini

Our strategy once the database has been created is then to run the following code snippet using pycytominer

from pycytominer.cyto_utils.cells import SingleCells

sc = SingleCells('sqlite:///backend.sqlite',
                 aggregation_operation='median')
sc.aggregate_profiles().to_csv(output_filename)

However, as we've discussed with @niranjchandrasekaran and @gwaygenomics , the first step of this operation seems to be to load the entire SQLite database into memory. It seems like this happens either here

pycytominer/pycytominer/cyto_utils/cells.py

Line 147 in c0d3e86

self.load_image()

or here (if load_image_data=False upon object instantiation)

pycytominer/pycytominer/cyto_utils/cells.py

Line 430 in c0d3e86

self.load_image()

These database files can be quite large, many tens of GB or even hundreds of GB. At some point it becomes untenable to have a machine with enough memory to handle this, as well as prohibitively expensive. This seems to be related to this issue #142

As I understand it, the current strategy is:

create the SQLite database
load the entire database into memory as a pandas dataframe
use pandas to manipulate the dataframe and perform aggregation

An alternative strategy I am proposing would be:

create the SQLite database
perform slightly more complex queries on the database on disk in order to perform the aggregation directly via SQL

This would have the advantage of truly making use of the database, and could really minimize the memory footprint.

I was thinking of something like this (https://www.sqlitetutorial.net/sqlite-group-by/) though I am no expert at SQL.

Is there interest in this? Does this seem like a good direction for pycytominer?

I do think it would probably require a decent amount of refactoring, since the aggregate() function in aggregate.py takes a pandas dataframe as an input, and a lot of the code currently involves fetching and formatting that dataframe. My current thinking is that aggregate() would take the database connection as an input argument, rather than an in-memory dataframe.

The text was updated successfully, but these errors were encountered:

gwaybio · 2021-07-02T11:07:44Z

This is wonderful - I am fully supportive of this direction and would welcome your contribution. A couple of thoughts:

I'd recommend adding the functionality to aggregate directly from SQL to the SingleCells class. Let's try keep pycytominer.aggregate() functional for pandas dataframes (we use aggregate elsewhere too). If that's not possible, then I'd recommend writing two separate functions:
- (1) sniff file input to pycytominer.aggregate()
- (2) perform SQL aggregation if sniffed = "SQL"
We're slowly moving away from SQL in favor of parquet. This is happening via a transition from cytominer-database to cytominer-transport.
- Practically, this means any functionality to aggregate depending on file type should try to be as separated as possible from the core pycytominer functions. We would still find SQL based aggregation very helpful, but it is a short-to-mid term solution. Really, this point is more evidence for your contribution containing the two functions as detailed above.
Minor point, but we're not loading the full database in SingleCells().load_image(). This is just loading the image.csv file. We are loading the full database when we run SingleCells().load_compartment() and merge single cells.

How does this sound? To me, I imagine this would be a big improvement. Although a SQL novice, I do have a sense that this would improve speed and reduce required resources quite a bit. What do you think @niranjchandrasekaran ? Any additional guidance for Stephen?

niranjchandrasekaran · 2021-07-02T14:18:08Z

I'd recommend adding the functionality to aggregate directly from SQL to the SingleCells class. Let's try keep pycytominer.aggregate() functional for pandas dataframes (we use aggregate elsewhere too).

I second this.

Minor point, but we're not loading the full database in SingleCells().load_image(). This is just loading the image.csv file. We are loading the full database when we run SingleCells().load_compartment() and merge single cells.

I feel like this distinction is important while @sjfleming is thinking about refactoring the code. Currently some of queries are made at the image table level while others are made at the whole database level and it will be important to get that right.

sjfleming · 2021-07-08T15:24:45Z

Thanks for your comments! I will try to work on this in my fork of the repo. I'll plan to leave pycytominer.aggregate() alone, and change only the way that aggregation is being done in the context of the SingleCells class.

And yes, I was not completely sure where the different parts of the database actually get loaded so this is a helpful explanation.

I won't be super speedy, but with luck I might have an update in a few weeks :)

gwaybio · 2021-07-08T15:32:08Z

Sounds great @sjfleming - I look forward to seeing your contribution, potentially in a couple of weeks

sjfleming · 2021-07-14T22:19:16Z

Okay, so digging into the code a bit more, I see that a real refactor to use aggregation via SQL queries would require a very substantial refactor. Since I wanted to touch as little code as possible, I introduced what I hope is a compromise... one which makes the minimal changes needed in order to reduce memory usage substantially.

I am still in the process of testing to make sure that these changes do in fact limit memory usage. I'm just going to try it out on a big dataset and monitor memory usage. I am not sure how I would include a pytest test case for this... it was not obvious to me.

gwaybio · 2021-07-15T14:26:22Z

Awesome, thanks @sjfleming - let me know when I should dig deeper into the code. Looks like you were able to add one test, which is great.

sjfleming · 2021-07-16T15:13:52Z

Alright @gwaygenomics , I have made a few more changes, and now I think the code is ready for review. Thanks!

sjfleming · 2021-08-06T15:18:54Z

Closed by #156

sjfleming mentioned this issue Jul 14, 2021

cells.aggregate_compartment() queries the database in chunks, reducing memory #156

Merged

13 tasks

sjfleming closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code refactor to limit memory usage? #152

Code refactor to limit memory usage? #152

sjfleming commented Jul 1, 2021

gwaybio commented Jul 2, 2021 •

edited

Loading

niranjchandrasekaran commented Jul 2, 2021

sjfleming commented Jul 8, 2021

gwaybio commented Jul 8, 2021

sjfleming commented Jul 14, 2021 •

edited

Loading

gwaybio commented Jul 15, 2021

sjfleming commented Jul 16, 2021

sjfleming commented Aug 6, 2021

Code refactor to limit memory usage? #152

Code refactor to limit memory usage? #152

Comments

sjfleming commented Jul 1, 2021

gwaybio commented Jul 2, 2021 • edited Loading

niranjchandrasekaran commented Jul 2, 2021

sjfleming commented Jul 8, 2021

gwaybio commented Jul 8, 2021

sjfleming commented Jul 14, 2021 • edited Loading

gwaybio commented Jul 15, 2021

sjfleming commented Jul 16, 2021

sjfleming commented Aug 6, 2021

gwaybio commented Jul 2, 2021 •

edited

Loading

sjfleming commented Jul 14, 2021 •

edited

Loading