-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pycytominer fails when running aggregate from large SQLite tables #142
Comments
ouch, that's not great. Thanks for tracking this down and reporting! It looks like this error comes from |
This is the total amount of tracing that I did, so I can't say for sure; my naive assumption is that, since load_compartment just returns out its dataframe, that it must be during that step or otherwise adding the chunksize wouldn't have actually helped, but I don't have data to back that up. |
I think this can be closed |
@d33bs - any objections to closing this? |
Thanks @gwaybio - really appreciated getting to read through these notes (thanks as well to @bethac07 for the great details here) in the context of some recent inspirations. I'm wondering if chunked DuckDB SQL queries to join metadata and compartment data would outperform pandas dataframe merges from I imagine recent updates have resolved some but maybe not all of the resource based failures. Even if/when we don't experience failures/exceptions, I feel reducing the amount of time taken to perform these actions could be helpful for developers. That said, maybe we could close this issue and explore these aspects further within #198 ? |
I was thinking this too (and possibly Beth was too) - I just wanted to check that everything is captured in that issue (which I think is!) |
Thanks @d33bs ! |
Reading in one table from an ~25-30GB SQLite file (so that table should have been 8-10GB total), it ran completely through all the 64 GB of memory on my machine, leading to the error below.
Happily, adding a chunksize parameter to
cyto_utils/cells.py/load_compartment
as below seems to avoid the worst of the issues (I did still see spikes into the 30s of GB on htop, not sure if it was during pandas.concat or during the actual aggregation itself). Reporting here as an issue rather than just making a PR becausea) not sure if you think there is anything worth investigating, I'd say probably not but your call and
b) not sure if you want pycytominer to ALWAYS use chunk sizes (and if so, if you want to use this particular chunk size); I can also imagine also doing chunksize as a user-passed option or pycytominer trying to assess database size in some way and then choosing whether and how to chunk.
Working in the
profiling
conda env (python 3.7.1) on an ubuntu 14 machine; pip freeze at the bottom.The text was updated successfully, but these errors were encountered: