-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cells.aggregate_compartment()
queries the database in chunks, reducing memory
#156
Conversation
cells.aggregate_compartment()
queries the database in chunks, reducing memory
I performed some tests using a 37GB SQLite database file. I realized that, when it comes to memory usage, more is not better. I had expected that using more of the available memory would lead to faster aggregation, but this is in fact the opposite of what happens. I suppose the whole task is I/O bound, and the actual aggregation in pandas is very very fast. So putting more in memory at once does not help. Test cases were carried out by specifying Results of tests --- runtime in (minutes:seconds)
Based on these results, I am going to remove the auto-scaling of |
I have checked the final outputs of aggregation on a 37GB test database:
and they seem to be identical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really an important contribution @sjfleming - thank you! I have made several suggestions and added many points of discussion that we'll need to talk through before merging.
@niranjchandrasekaran is also planning on talking a look at the whole PR, and I've tagged him at two specific places where I think an extra careful eye is needed.
My main comment is that I'd really like to understand more about _sqlite_strata_conditions()
and test it slightly more thoroughly. I'd also like to modify some variable names and docstrings, and I have a couple of one-off suggestions/comments.
I think we are close to merging!
@@ -62,7 +62,7 @@ def aggregate( | |||
# Subset the data to specified samples | |||
if isinstance(subset_data_df, pd.DataFrame): | |||
population_df = subset_data_df.merge( | |||
population_df, how="left", on=subset_data_df.columns.tolist() | |||
population_df, how="inner", on=subset_data_df.columns.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this alters anything, but I am curious why you made this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! Without this change, my chunked aggregation strategy fails for the following reason: for each chunk, I call pycytominer.aggregate()
, and I create one aggregated data frame per chunk, which I then concatenate at the end into the final output. If the aggregation per chunk occurs with a "left" merge, then all the rows of subset_data_df
that are not part of this chunk will get NaN
entries. For example:
# using "left" merge for a chunk containing only well A01
well image value
A01 1 1.543
A01 2 2.442
A02 1 NaN
# using "inner" merge for a chunk containing only well A01
well image value
A01 1 1.543
A01 2 2.442
and then after the final concatenation step
# using "left" merge
well image value
A01 1 1.543
A01 2 2.442
A02 1 NaN
A02 1 0.234
# using "inner" merge
well image value
A01 1 1.543
A01 2 2.442
A01 2 2.442
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha! This makes total sense. Thanks for the explanation.
pycytominer/cyto_utils/cells.py
Outdated
@@ -401,6 +401,7 @@ def aggregate_compartment( | |||
compartment, | |||
compute_subsample=False, | |||
compute_counts=False, | |||
n_strata=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you name this something more descriptive? How about n_aggregation_memory_strata
or something similar? n_strata
does not make it obvious what the argument actually does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm happy to re-think this name. I also wasn't very happy with it.
Whether or not to compute the number of objects in each compartment | ||
and the number of fields of view per well. | ||
n_strata : int, default 1 | ||
Number of unique strata to pull from the database into working memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is strata
the right word? Maybe chunk
? I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What it really is, is the number of unique aggregation strata being pulled into one chunk. For example, if we are aggregating by "well", then n_strata=1
means that one "well" will be pulled from the SQLite database into memory at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perfect. Can you add your example to the docstring? I think this makes the description super clear. I also love the new variable name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the example
pycytominer/cyto_utils/cells.py
Outdated
and the number of fields of view per well. | ||
n_strata : int, default 1 | ||
Number of unique strata to pull from the database into working memory | ||
at once. A larger number does not lead to faster compute times |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very helpful hint! Do you know of any specific guidance one might use? if so, can you add?
Also, should it be "a larger number does not necessarily lead to faster compute times"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my (limited) experiments, a larger number actually leads to slower compute times (see my comment in the PR above). It seems the best thing to do is probably always to have n_strata=1
pycytominer/cyto_utils/cells.py
Outdated
|
||
Parameters | ||
---------- | ||
df : pandas.core.frame.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a compartment data frame? Is this merged with image_df? Can you be more specific in the docstring about what columns to expect in this dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very strange dataframe that is neither of the above. It is a new dataframe constructed with the sole purpose of figuring out how to divide the compartment
table into chunks.
The rows represent unique self.strata
in the data. The columns are self.merge_cols
. The values tell what are the values of merge_cols
that are contained in each of the strata... i.e., how to chunk the data.
pycytominer/cyto_utils/cells.py
Outdated
|
||
# Translate the first strata combination into a SQLite condition string | ||
first_stratum_condition = _sqlite_strata_conditions( | ||
df=df_unique_mergecols.head(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it took me a while to figure out that df_unique_mergecols
is defined out of scope of this function. Should you include as a function argument in _sqlite_strata_conditions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df_unique_mergecols
is in scope of _compartment_df_generator()
, but out of scope of _sqlite_strata_conditions()
, and not used in _sqlite_strata_conditions()
.
Maybe I made a strange choice by nesting _sqlite_strata_conditions()
inside _compartment_df_generator()
? The only reason I did that was because I was pretty sure no other function would ever call it.
pycytominer/cyto_utils/cells.py
Outdated
return grouped_conditions | ||
|
||
# Translate the first strata combination into a SQLite condition string | ||
first_stratum_condition = _sqlite_strata_conditions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are you using first_stratum_condition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not. Good catch! This is an old thing I didn't entirely delete.
pycytominer/cyto_utils/cells.py
Outdated
.reset_index(drop=True) | ||
) | ||
|
||
def _sqlite_strata_conditions(df, dtypes, n=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my read, this function is the most critical contribution in this PR (cc @niranjchandrasekaran). It's probably because I'm not an expert in generator functions and how they're used, but I am having some trouble following the logic.
I make a couple comments in this function to try to help my understanding, but if you can see where my knowledge gap is, please let me know. Right now, I think this function will be difficult to maintain unless we provide a couple more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there should be some dedicated tests to demonstrate the purpose of this function
output_file=compress_file, | ||
compute_subsample=True, | ||
compression_options={"method": "gzip"}, | ||
n_strata=1, # this will force multiple queries from each compartment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run this test multiple times looping through different n_strata
:
E.g.
for n_strata in [1, 2, 3]:
...
aggregate_profiles(n_strata=n_strata)
and confirm that the expected result doesn't change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a separate test where you set n_strata
to an invalid argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a good idea in terms of testing multiple values for n_strata
.
We can test invalid n_strata
values... but the only truly invalid values should be < 1
, and there is an explicit assertion for that. But we can still include a test.
I have made an attempt to address all your comments and make relevant changes @gwaygenomics . I added a couple of tests to address a few comments as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really great to see this happening. I made a couple comments.
@niranjchandrasekaran - are you able to take a look? I think it's worth having your eyes on the full PR, but I did note particular focus areas.
@@ -62,7 +62,7 @@ def aggregate( | |||
# Subset the data to specified samples | |||
if isinstance(subset_data_df, pd.DataFrame): | |||
population_df = subset_data_df.merge( | |||
population_df, how="left", on=subset_data_df.columns.tolist() | |||
population_df, how="inner", on=subset_data_df.columns.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha! This makes total sense. Thanks for the explanation.
Whether or not to compute the number of objects in each compartment | ||
and the number of fields of view per well. | ||
n_strata : int, default 1 | ||
Number of unique strata to pull from the database into working memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perfect. Can you add your example to the docstring? I think this makes the description super clear. I also love the new variable name!
pycytominer/cyto_utils/cells.py
Outdated
) | ||
dtype_dict = dict( | ||
zip( | ||
[s[7:-1] for s in compartment_dtypes.columns], # strip typeof( ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to create a different request without typeof()
to get the actual column names in the proper order? I'd be in favor of two queries to improve readability/maintainability rather than hardcoding something that is also difficult to read (even if it is robust).
@@ -846,3 +775,67 @@ def aggregate_profiles( | |||
) | |||
else: | |||
return aggregated | |||
|
|||
|
|||
def _sqlite_strata_conditions(df, dtypes, n=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a separately defined function? When one imports SingleCells
will it automatically know this function definition? I thought that you had it as a private method of the class before? Any reason you decided to make it separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When one imports SingleCells
, yes, a SingleCells
object will automatically be able to see that function. I did have it as a private method of the class before. I can move it back if we want. The only reason I moved it out is that I realized it doesn't actually rely on anything from SingleCells
and it was just a tiny bit easier to import it for testing as a separate function. Though it could still be imported for testing either way
---------- | ||
df : pandas.core.frame.DataFrame | ||
A dataframe where columns are merge_cols and rows represent | ||
unique aggregation strata of the compartment table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique aggregation strata of the compartment table | |
the specific image IDs grouped by unique aggregation strata (e.g. Metadata_well) of the image table |
Is this more accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. Only I think it turns out to be whatever is in merge_cols
, so "ImageNumber" and "TableNumber"
dtypes : Dict[str, str] | ||
Dictionary to look up SQLite datatype based on column name | ||
n : int | ||
Number of rows of the input df to combine in each output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really great documentation - very appreciated
Returns | ||
------- | ||
grouped_conditions : List[str] | ||
A list of strings, each string being a valid SQLite conditional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a couple words describing how this is meant to be used?
"being a valid SQLite conditional to facilitate chunked aggregation?" 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah, I can add a bit. Really what I mean is that it will return a list of strings (called conditions
), where each of the strings can be used in an SQL statement like this:
f"select {cols} from {compartment} where {condition}"
Here condition
is one element of the list of strings grouped_conditions
, and the above SQL statement will grab a single chunk for aggregation. So by iterating over the conditions in grouped_conditions
and making SQL queries, we get the whole table back in chunks:
for condition in grouped_conditions:
do_sql_query(f"select {cols} from {compartment} where {condition}")
@@ -6,7 +6,7 @@ | |||
from sqlalchemy import create_engine | |||
|
|||
from pycytominer import aggregate, normalize | |||
from pycytominer.cyto_utils.cells import SingleCells | |||
from pycytominer.cyto_utils.cells import SingleCells, _sqlite_strata_conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see, you're needing to input this as a new function. I typically think of _single_leading_underscore
as being private methods, and they're defined as "internal use" characters in pep8. I'm wondering if I'm missing any critical insight that led you to separate out this method (This is kind of a duplicate comment of one previous)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real insight, no. Yeah I just did it that way so I could easily import it to test it here. And the leading underscore should denote to a user that it's private and never to be used directly for anything.
] | ||
out1 = _sqlite_strata_conditions( | ||
df=df, | ||
dtypes={"TableNumber": "integer", "ImageNumber": "integer"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the dtype
ever be a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, "text" - I see below, gotcha!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seems like the dtype
s for SQL are null
, real
, text
, integer
, and blob
. But the TableNumber and ImageNumber values I think in practice are always of integer
type, though I am not 100% sure
a quick note @sjfleming - can you also add your name to the software citation section in this PR? Thanks! |
…tment Introduce a generator function to obtain data from a compartment table in chunks of whole strata. These chunks can be aggregated separately, and the results combined. Memory efficient.
bc388a8
to
003812a
Compare
I rebased on top of the current master branch to get the latest changes just now |
Thanks @gwaygenomics , I took a guess about where to put myself in that author list, please do correct me if it's the wrong place :) |
Co-authored-by: Greg Way <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I don't any additional comments
|
||
# Obtain all valid strata combinations, and their merge_cols values | ||
df_unique_mergecols = ( | ||
self.image_df[self.strata + self.merge_cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think self.strata
and self.merge_cols
will always be in self.image_df
and they will be unique. I trying to imagine when the two lists will not be unique. We could aggregate at the image-level but for that we will use Metadata_Site
in self.strata
and not ImageNumber
. Other than that, I can't think of an overlap between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @sjfleming - I think we can merge this in.
One last comment (sorry I should have mentioned this sooner!) but can you make sure to run black
on the files you've touched?
@gwaygenomics Great! Yes, I have already run |
Running final tests now! |
Codecov Report
@@ Coverage Diff @@
## master #156 +/- ##
==========================================
+ Coverage 98.04% 98.09% +0.04%
==========================================
Files 49 49
Lines 2199 2253 +54
==========================================
+ Hits 2156 2210 +54
Misses 43 43
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
thanks again Stephen! |
Description
These changes were made directly in response to #152 , and may impact #142 .
The intent is to limit memory usage during
pycytominer.cyto_utils.cells.aggregate_compartment()
calls, and other functions which call this method, includingpycytominer.cyto_utils.cells.aggregate_profiles()
. One explicit stipulation was to touchpycytominer.aggregate.aggregate()
as little as possible. (I did have to make one tiny change.)What is the nature of your change?
Checklist