-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code refactor to limit memory usage? #152
Comments
This is wonderful - I am fully supportive of this direction and would welcome your contribution. A couple of thoughts:
How does this sound? To me, I imagine this would be a big improvement. Although a SQL novice, I do have a sense that this would improve speed and reduce required resources quite a bit. What do you think @niranjchandrasekaran ? Any additional guidance for Stephen? |
I second this.
I feel like this distinction is important while @sjfleming is thinking about refactoring the code. Currently some of queries are made at the image table level while others are made at the whole database level and it will be important to get that right. |
Thanks for your comments! I will try to work on this in my fork of the repo. I'll plan to leave And yes, I was not completely sure where the different parts of the database actually get loaded so this is a helpful explanation. I won't be super speedy, but with luck I might have an update in a few weeks :) |
Sounds great @sjfleming - I look forward to seeing your contribution, potentially in a couple of weeks |
Okay, so digging into the code a bit more, I see that a real refactor to use aggregation via SQL queries would require a very substantial refactor. Since I wanted to touch as little code as possible, I introduced what I hope is a compromise... one which makes the minimal changes needed in order to reduce memory usage substantially. I am still in the process of testing to make sure that these changes do in fact limit memory usage. I'm just going to try it out on a big dataset and monitor memory usage. I am not sure how I would include a pytest test case for this... it was not obvious to me. |
Awesome, thanks @sjfleming - let me know when I should dig deeper into the code. Looks like you were able to add one test, which is great. |
Alright @gwaygenomics , I have made a few more changes, and now I think the code is ready for review. Thanks! |
Closed by #156 |
Hi all,
I've been working with Carmen @carmendv from the Precision Cardiology Lab on using
pycytominer
to aggregate and normalize data which has been saved in an SQLite database format by runningOur strategy once the database has been created is then to run the following code snippet using
pycytominer
However, as we've discussed with @niranjchandrasekaran and @gwaygenomics , the first step of this operation seems to be to load the entire SQLite database into memory. It seems like this happens either here
pycytominer/pycytominer/cyto_utils/cells.py
Line 147 in c0d3e86
or here (if
load_image_data=False
upon object instantiation)pycytominer/pycytominer/cyto_utils/cells.py
Line 430 in c0d3e86
These database files can be quite large, many tens of GB or even hundreds of GB. At some point it becomes untenable to have a machine with enough memory to handle this, as well as prohibitively expensive. This seems to be related to this issue #142
As I understand it, the current strategy is:
An alternative strategy I am proposing would be:
This would have the advantage of truly making use of the database, and could really minimize the memory footprint.
I was thinking of something like this (https://www.sqlitetutorial.net/sqlite-group-by/) though I am no expert at SQL.
Is there interest in this? Does this seem like a good direction for
pycytominer
?I do think it would probably require a decent amount of refactoring, since the
aggregate()
function in aggregate.py takes a pandas dataframe as an input, and a lot of the code currently involves fetching and formatting that dataframe. My current thinking is thataggregate()
would take the database connection as an input argument, rather than an in-memory dataframe.The text was updated successfully, but these errors were encountered: