Reduce memory consumption of SingleCells.merge_single_cells #234

d33bs · 2022-10-06T21:38:11Z

Description

These changes seek to address #233 memory challenges when using SingleCells.merge_single_cells by segmenting Pandas merges using dynamic argument chunksize and related dataframe concatenation within the method. In brief testing using small datasets with these changes I saw around 12-18% reduction in peak memory use (using chunksize default).

Using this segmented merge approach we're making a tradeoff between memory and time; in large data scenarios we may avoid out-of-memory issues and the procedure may take longer due to more (but smaller) merges being required. chunksize is an optional argument which when left to it's default of None will automatically be set to 1/3rd of the first compartment dataframe rowlength. Performance may vary with chunksize contingent on the dataset and is the reasoning for this being an argument which may be optionally set by a developer.

Caveats:

Leveraging existing suffix column naming within the merges along with the increase in number of merges increases the number of warnings emitted during testing. (FutureWarning: Passing 'suffixes' which cause duplicate columns {'ObjectNumber_cytoplasm'} in the result is deprecated and will raise a MergeError in a future version.).
Testing via test_cells.py needed updates in order for the changes within cells.py to pass pytesting. With these changes I focused on ensuring the data itself was as expected. I'd ask for double-checks here especially to make sure everything looks good.
Performance will likely vary based on the resources of the system running pycytominer and the dataset involved. Using the chunksize default does not guarantee avoidance of out-of-memory issues and instead provides flexibility for developer implementation.

Thank you @axiomcura for the excellent writeup on the performance issue!

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings. (see caveats above)
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

gwaybio

Looks good! Please address my comments (focusing on the testing comments)

Also, would reducing to float32 here help a lot as well? This is within scope of the name of your PR

pycytominer/cyto_utils/cells.py

gwaybio · 2022-10-07T18:13:13Z

pycytominer/tests/test_cyto_utils/test_cells.py

+        "Metadata_ObjectNumber_cells",
+    ]
+    # check that we have the same data using same cols, sort and a reset index
+    pd.testing.assert_frame_equal(


this doesn't looked blacked to me, please confirm

I ran black on this, and --check doesn't seem to flag it (may be profile related). I'm seeing that there's a list end bumping up against the comment, I'll add in a newline there with new changes. Are you seeing anything else that looks stylistically out of place which I may change?

gwaybio · 2022-10-07T18:14:37Z

pycytominer/tests/test_cyto_utils/test_cells.py

@@ -493,8 +517,8 @@ def test_merge_single_cells_cytominer_database_test_file():
    # note: pd.DataFrame datatypes sometimes appear automatically changed on-read, so we cast
    # the result_file dataframe using the base dataframe's types.
    pd.testing.assert_frame_equal(
-        pd.read_csv(csv_path).astype(merged_sc.dtypes.to_dict()),
-        pd.read_csv(result_file).astype(merged_sc.dtypes.to_dict()),
+        pd.read_csv(csv_path).astype(merged_sc.dtypes.to_dict())[merged_sc.columns],


why is this necessary now?

Thank you for calling this out! For inner merges, Pandas "...preserve[s] the order of the left keys." (reference: pd.merge(..., how)). The column specification here is provided to account for the now differently ordered columns due to the "right" and "left" being swapped within the compartment merges. I also felt that accounting for exact column ordering over time may be unwieldy, so this may provide benefit towards development velocity. That said, we may need a change; do you think we should enforce strict column order within merge_single_cells to remove the need for specification here?

@gwaybio - just wanted to follow up here with some thoughts about merge_single_cells column order specification. If we make explicit column sorting a part of the method, we should be able to reduce the variability of data produced. Note: tests may still require column filtering (for example, if certain data do not exist in one dataframe vs another).

I wrote the following as an example of what we could do here:

import pandas as pd # create an example dataframe with mixed order columns df = pd.DataFrame( { "Image_Metadata_1": [0], "TableNumber": [0], "ImageNumber": [0], "Cytoplasm_Data_1": [0], "Nuclei_Data_1": [0], "Cells_Data_1": [0], "Mito_Data_1": [0], "Actin_Data_1": [0], "Image_Data_1": [0], } ) # print df columns as a python list, representing new order print("Initial order:") print(df.columns.tolist()) def custom_sort(value: str): """ A custom sort for Pycytominer merge_single_cells pd.dataframe columns """ # lowercase str which will be used for comparisons # to avoid any capitalization challenges value_lower = value.lower() # first sorted values (by list index) sort_first = ["tablenumber", "imagenumber"] # middle sort value sort_middle = "metadata" # sorted last (by list order enumeration) sort_later = [ "cells", "cytoplasm", "nuclei", "image", ] # if value is in the sort_first list # return the index from that list if value_lower in sort_first: return sort_first.index(value_lower) # if sort_middle is anywhere in value return # next index value after sort_first values elif sort_middle in value_lower: return len(sort_first) # if any sort_later are found as the first part of value # return enumerated index of sort_later value (starting from # relative len based on the above conditionals and lists) elif any(value_lower.startswith(val) for val in sort_later): for k, v in enumerate(sort_later, start=len(sort_first) + 1): if value_lower.startswith(v): return k # else we return the total length of all sort values return len(sort_first) + len(sort_later) + 1 # inner sorted alphabetizes any columns which may not be part of custom_sort # outer sort provides pycytominer-specific column sort order df = df[sorted(sorted(df.columns), key=custom_sort)] # print df columns as a python list, representing new order print("\nSorted order:") print(df.columns.tolist())

Which has printed output:

Initial order: ['Image_Metadata_1', 'TableNumber', 'ImageNumber', 'Cytoplasm_Data_1', 'Nuclei_Data_1', 'Cells_Data_1', 'Mito_Data_1', 'Actin_Data_1', 'Image_Data_1'] Sorted order: ['TableNumber', 'ImageNumber', 'Image_Metadata_1', 'Cells_Data_1', 'Cytoplasm_Data_1', 'Nuclei_Data_1', 'Image_Data_1', 'Actin_Data_1', 'Mito_Data_1']

gwaybio · 2022-10-07T18:16:20Z

pycytominer/tests/test_cyto_utils/test_cells.py

+    # check that we have the same data using same cols, sort and a reset index
+    pd.testing.assert_frame_equal(
+        left=manual_merge[assert_cols]
+        .sort_values(by=assert_cols, ascending=True)


I am a bit concerned that sorting is giving us undue confidence of equivalent results. Am I reading this wrong? How should I be thinking about this?

Can you add an explicit check that the object number is the same? Oh, but maybe this code block is doing exactly that (and more!)?

Great comment, thank you! The chunked merges made the data inconsistent with the existing tests. The check here now looks at columns "Metadata_ObjectNumber", "Metadata_ObjectNumber_cytoplasm", "Metadata_ObjectNumber_cells", sorted by the same columns in the same order for data equivalency and does include an Metadata_ObjectNumber check as a result.

Would users of merge_single_cells expect for sorted output by the same columns or similar? We could include this as an additional step within the method, which would likely avoid needing to do that manually within testing here.

pycytominer/cyto_utils/cells.py

Co-authored-by: Gregory Way <[email protected]>

d33bs · 2022-10-07T20:01:01Z

Thank you @gwaybio for the comments and feedback! I added comment + changes for individual thoughts and wanted to follow up here to your mention about float32. Converting from float64 (a default which I believe cascades from numpy to pandas) to float32 would very likely reduce the memory footprint of this method. Because this could mean data loss under some circumstances, would it be beneficial to make this an additional argument to allow development options?

gwaybio · 2022-10-07T20:13:34Z

Yes, good idea to make an optional/experimentally-untested argument

…

On Fri, Oct 7, 2022, 2:01 PM Dave Bunten ***@***.***> wrote: Thank you @gwaybio <https://github.com/gwaybio> for the comments and feedback! I added comment + changes for individual thoughts and wanted to follow up here to your mention about float32. Converting from float64 (a default which I believe cascades from numpy to pandas) to float32 would very likely reduce the memory footprint of this method. Because this could mean data loss under some circumstances, would it be beneficial to make this an additional argument to allow development options? — Reply to this email directly, view it on GitHub <#234 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABYADBI5XBB6TYVQG3PBVXLWCB6QRANCNFSM6AAAAAAQ7A5MYU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

d33bs · 2022-10-11T16:09:42Z

@gwaybio and @axiomcura - I made some changes to allow optional float_dtype specification for merge_single_cells. The argument, float_datatype, defaults to using np.float64 to ensure no accidental data loss (developers must specify a smaller or different datatype). When using np.float32 instead of the default np.float64 I'm seeing around 48% reduction in memory use with small tests. This may vary contingent on the compartments, features, and overall data size being used by the method.

gwaybio · 2022-12-15T17:01:07Z

@d33bs - sorry to have lost track of this PR. Can you remind me where we are with this? is it ready to merge?

d33bs · 2022-12-15T18:37:05Z

@d33bs - sorry to have lost track of this PR. Can you remind me where we are with this? is it ready to merge?

Thanks @gwaybio and no worries! I feel that this PR should be closed without a merge, with justifications and detailed thoughts below. Very open to other ideas and discussion here! 🙂

Justifications on Closing without Merge:

We've bound at least three major changes within this PR which could each otherwise be treated separately: 1. Pandas merge performance enhancements, 2. floating point datatype options and defaults, 3. column order within merged results (definition as well as implementation of order).
(pertaining to change 1.) As-is, this PR I think may add greater complexity surrounding how Pandas merge operations function within merge_single_cells. Adding greater complexity with Pandas merge operations may be risky (lowering understandability, somewhat unique to Pandas data handling, etc) and could prove unsustainable for development in pycytominer. There may be other more desirable chunking options (for example, as suggested here and below).

Forward momentum:

(pertaining to change 2.) The floating point datatype options and defaults feel like the most clear way to address performance issues, accepting the caveat that this is a temporary and maybe sometimes scientifically unverified path forward. This may also be the quickest way to enable progress for Converting cell-health sqlite files to parquet pipeline WayScience/cell-health-data#10 (>=20 GB dataset resource exceptions).
- Suggested action items:
  1. We open a new issue to outline the focus and need here
  2. We open a new PR for review with roughly the content of 298bf3a and b4634e2.
(pertaining to change 3.) Would it be valuable to programmatically address the column order of merge_single_cells results within pycytominer (for example, see comment)? As things stand, some of this is implied through Pandas merge and join operations. I'm uncertain about what the downstream effects might be to implementing changes here, but it may be a useful challenge to face in order to address general merge/join performance with reproducibility in mind.
- Suggested action items:
  1. If it's valuable, we open a new issue to outline the focus and need here.
(pertaining to change 1.) The items above would provide greater definition for compartment merge performance considerations. Combining thoughts from other conversations, I wonder what performance gains might be possible through the use of an in-memory DuckDB client for SQLite data extraction and chunking as PyArrow Tables. This may only make sense in scenarios where we export to a file (as otherwise we may face too many penalties in cross-conversions).
- Brainstorming: one somewhat in-place approach could involve adding an optional parameter for load_compartment to return Arrow (defaulting to Pandas). If Arrow is optionally used within merge_single_cells, we could define a chunked join to compare performance. It could also make sense to perform the join work without loading each compartment first. The optional nature of the in-memory data format selection could help address backwards compatibility (with Pandas) while acknowledging and striving for greater design scalability (with Arrow).

What are your thoughts? Should we move forward with any of the above as outlined? Please let me know if you have questions/suggestions/comments throughout.

gwaybio · 2022-12-16T16:18:18Z

Thanks @d33bs for catching me up!

What are your thoughts? Should we move forward with any of the above as outlined?

No. I feel that because of limited bandwidth, all of our efforts should be placed toward pycytominer-transform at the moment. I feel that once we have transform, we can start to deprecate SingleCells functionality, which will invariably make these short term patches a waste of time.

However, if you sense that any of these patches might be really quick, then let's do them! There is an unfortunate time trade-off here, and we need to be really thoughtful about were to allocate attention.

I'll close now and we'll use a short part of our meeting this morning to discuss any of these short-term pycytominer targets. Thanks again!

d33bs and others added 2 commits October 6, 2022 13:09

reduce memory use by merge_single_cells

422f3a4

add comments for chunksize

e0062a4

d33bs requested review from falquaddoomi and gwaybio October 6, 2022 21:38

gwaybio requested changes Oct 7, 2022

View reviewed changes

d33bs and others added 2 commits October 7, 2022 12:53

Apply renaming suggestions from code review

cc68ec5

Co-authored-by: Gregory Way <[email protected]>

rename chunksize and chunk vars; test spacing

65d2a42

axiomcura mentioned this pull request Oct 10, 2022

Converting cell-health sqlite files to parquet pipeline WayScience/cell-health-data#10

Closed

d33bs added 2 commits October 10, 2022 15:55

add optional float dtype specification

298bf3a

test merge_single_cells non-default float dtype

b4634e2

gwaybio self-requested a review October 19, 2022 20:44

gwaybio closed this Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory consumption of SingleCells.merge_single_cells #234

Reduce memory consumption of SingleCells.merge_single_cells #234

d33bs commented Oct 6, 2022

gwaybio left a comment

gwaybio Oct 7, 2022

d33bs Oct 7, 2022

gwaybio Oct 7, 2022

d33bs Oct 7, 2022

d33bs Oct 14, 2022

gwaybio Oct 7, 2022

d33bs Oct 7, 2022

d33bs commented Oct 7, 2022

gwaybio commented Oct 7, 2022 via email

d33bs commented Oct 11, 2022

gwaybio commented Dec 15, 2022

d33bs commented Dec 15, 2022

gwaybio commented Dec 16, 2022

Reduce memory consumption of SingleCells.merge_single_cells #234

Reduce memory consumption of SingleCells.merge_single_cells #234

Conversation

d33bs commented Oct 6, 2022

Description

What is the nature of your change?

Checklist

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio Oct 7, 2022

Choose a reason for hiding this comment

d33bs Oct 7, 2022

Choose a reason for hiding this comment

gwaybio Oct 7, 2022

Choose a reason for hiding this comment

d33bs Oct 7, 2022

Choose a reason for hiding this comment

d33bs Oct 14, 2022

Choose a reason for hiding this comment

gwaybio Oct 7, 2022

Choose a reason for hiding this comment

d33bs Oct 7, 2022

Choose a reason for hiding this comment

d33bs commented Oct 7, 2022

gwaybio commented Oct 7, 2022 via email

d33bs commented Oct 11, 2022

gwaybio commented Dec 15, 2022

d33bs commented Dec 15, 2022

gwaybio commented Dec 16, 2022