Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption of SingleCells.merge_single_cells #234

Closed
wants to merge 6 commits into from
Closed

Reduce memory consumption of SingleCells.merge_single_cells #234

wants to merge 6 commits into from

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Oct 6, 2022

Description

These changes seek to address #233 memory challenges when using SingleCells.merge_single_cells by segmenting Pandas merges using dynamic argument chunksize and related dataframe concatenation within the method. In brief testing using small datasets with these changes I saw around 12-18% reduction in peak memory use (using chunksize default).

Using this segmented merge approach we're making a tradeoff between memory and time; in large data scenarios we may avoid out-of-memory issues and the procedure may take longer due to more (but smaller) merges being required. chunksize is an optional argument which when left to it's default of None will automatically be set to 1/3rd of the first compartment dataframe rowlength. Performance may vary with chunksize contingent on the dataset and is the reasoning for this being an argument which may be optionally set by a developer.

Caveats:

  • Leveraging existing suffix column naming within the merges along with the increase in number of merges increases the number of warnings emitted during testing. (FutureWarning: Passing 'suffixes' which cause duplicate columns {'ObjectNumber_cytoplasm'} in the result is deprecated and will raise a MergeError in a future version.).
  • Testing via test_cells.py needed updates in order for the changes within cells.py to pass pytesting. With these changes I focused on ensuring the data itself was as expected. I'd ask for double-checks here especially to make sure everything looks good.
  • Performance will likely vary based on the resources of the system running pycytominer and the dataset involved. Using the chunksize default does not guarantee avoidance of out-of-memory issues and instead provides flexibility for developer implementation.

Thank you @axiomcura for the excellent writeup on the performance issue!

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings. (see caveats above)
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@d33bs d33bs requested review from falquaddoomi and gwaybio October 6, 2022 21:38
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please address my comments (focusing on the testing comments)

Also, would reducing to float32 here help a lot as well? This is within scope of the name of your PR

pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
"Metadata_ObjectNumber_cells",
]
# check that we have the same data using same cols, sort and a reset index
pd.testing.assert_frame_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't looked blacked to me, please confirm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran black on this, and --check doesn't seem to flag it (may be profile related). I'm seeing that there's a list end bumping up against the comment, I'll add in a newline there with new changes. Are you seeing anything else that looks stylistically out of place which I may change?

@@ -493,8 +517,8 @@ def test_merge_single_cells_cytominer_database_test_file():
# note: pd.DataFrame datatypes sometimes appear automatically changed on-read, so we cast
# the result_file dataframe using the base dataframe's types.
pd.testing.assert_frame_equal(
pd.read_csv(csv_path).astype(merged_sc.dtypes.to_dict()),
pd.read_csv(result_file).astype(merged_sc.dtypes.to_dict()),
pd.read_csv(csv_path).astype(merged_sc.dtypes.to_dict())[merged_sc.columns],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for calling this out! For inner merges, Pandas "...preserve[s] the order of the left keys." (reference: pd.merge(..., how)). The column specification here is provided to account for the now differently ordered columns due to the "right" and "left" being swapped within the compartment merges. I also felt that accounting for exact column ordering over time may be unwieldy, so this may provide benefit towards development velocity. That said, we may need a change; do you think we should enforce strict column order within merge_single_cells to remove the need for specification here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwaybio - just wanted to follow up here with some thoughts about merge_single_cells column order specification. If we make explicit column sorting a part of the method, we should be able to reduce the variability of data produced. Note: tests may still require column filtering (for example, if certain data do not exist in one dataframe vs another).

I wrote the following as an example of what we could do here:

import pandas as pd

# create an example dataframe with mixed order columns
df = pd.DataFrame(
    {
        "Image_Metadata_1": [0],
        "TableNumber": [0],
        "ImageNumber": [0],
        "Cytoplasm_Data_1": [0],
        "Nuclei_Data_1": [0],
        "Cells_Data_1": [0],
        "Mito_Data_1": [0],
        "Actin_Data_1": [0],
        "Image_Data_1": [0],
    }
)

# print df columns as a python list, representing new order
print("Initial order:")
print(df.columns.tolist())


def custom_sort(value: str):
    """
    A custom sort for Pycytominer merge_single_cells
    pd.dataframe columns
    """

    # lowercase str which will be used for comparisons
    # to avoid any capitalization challenges
    value_lower = value.lower()

    # first sorted values (by list index)
    sort_first = ["tablenumber", "imagenumber"]

    # middle sort value
    sort_middle = "metadata"

    # sorted last (by list order enumeration)
    sort_later = [
        "cells",
        "cytoplasm",
        "nuclei",
        "image",
    ]

    # if value is in the sort_first list
    # return the index from that list
    if value_lower in sort_first:
        return sort_first.index(value_lower)

    # if sort_middle is anywhere in value return
    # next index value after sort_first values
    elif sort_middle in value_lower:
        return len(sort_first)

    # if any sort_later are found as the first part of value
    # return enumerated index of sort_later value (starting from
    # relative len based on the above conditionals and lists)
    elif any(value_lower.startswith(val) for val in sort_later):
        for k, v in enumerate(sort_later, start=len(sort_first) + 1):
            if value_lower.startswith(v):
                return k

    # else we return the total length of all sort values
    return len(sort_first) + len(sort_later) + 1


# inner sorted alphabetizes any columns which may not be part of custom_sort
# outer sort provides pycytominer-specific column sort order
df = df[sorted(sorted(df.columns), key=custom_sort)]

# print df columns as a python list, representing new order
print("\nSorted order:")
print(df.columns.tolist())

Which has printed output:

Initial order:
['Image_Metadata_1', 'TableNumber', 'ImageNumber', 'Cytoplasm_Data_1', 'Nuclei_Data_1', 'Cells_Data_1', 'Mito_Data_1', 'Actin_Data_1', 'Image_Data_1']

Sorted order:
['TableNumber', 'ImageNumber', 'Image_Metadata_1', 'Cells_Data_1', 'Cytoplasm_Data_1', 'Nuclei_Data_1', 'Image_Data_1', 'Actin_Data_1', 'Mito_Data_1']

# check that we have the same data using same cols, sort and a reset index
pd.testing.assert_frame_equal(
left=manual_merge[assert_cols]
.sort_values(by=assert_cols, ascending=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned that sorting is giving us undue confidence of equivalent results. Am I reading this wrong? How should I be thinking about this?

Can you add an explicit check that the object number is the same? Oh, but maybe this code block is doing exactly that (and more!)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment, thank you! The chunked merges made the data inconsistent with the existing tests. The check here now looks at columns "Metadata_ObjectNumber", "Metadata_ObjectNumber_cytoplasm", "Metadata_ObjectNumber_cells", sorted by the same columns in the same order for data equivalency and does include an Metadata_ObjectNumber check as a result.

Would users of merge_single_cells expect for sorted output by the same columns or similar? We could include this as an additional step within the method, which would likely avoid needing to do that manually within testing here.

pycytominer/cyto_utils/cells.py Outdated Show resolved Hide resolved
@d33bs
Copy link
Member Author

d33bs commented Oct 7, 2022

Thank you @gwaybio for the comments and feedback! I added comment + changes for individual thoughts and wanted to follow up here to your mention about float32. Converting from float64 (a default which I believe cascades from numpy to pandas) to float32 would very likely reduce the memory footprint of this method. Because this could mean data loss under some circumstances, would it be beneficial to make this an additional argument to allow development options?

@gwaybio
Copy link
Member

gwaybio commented Oct 7, 2022 via email

@d33bs
Copy link
Member Author

d33bs commented Oct 11, 2022

@gwaybio and @axiomcura - I made some changes to allow optional float_dtype specification for merge_single_cells. The argument, float_datatype, defaults to using np.float64 to ensure no accidental data loss (developers must specify a smaller or different datatype). When using np.float32 instead of the default np.float64 I'm seeing around 48% reduction in memory use with small tests. This may vary contingent on the compartments, features, and overall data size being used by the method.

@gwaybio gwaybio self-requested a review October 19, 2022 20:44
@gwaybio
Copy link
Member

gwaybio commented Dec 15, 2022

@d33bs - sorry to have lost track of this PR. Can you remind me where we are with this? is it ready to merge?

@d33bs
Copy link
Member Author

d33bs commented Dec 15, 2022

@d33bs - sorry to have lost track of this PR. Can you remind me where we are with this? is it ready to merge?

Thanks @gwaybio and no worries! I feel that this PR should be closed without a merge, with justifications and detailed thoughts below. Very open to other ideas and discussion here! 🙂

Justifications on Closing without Merge:

  • We've bound at least three major changes within this PR which could each otherwise be treated separately: 1. Pandas merge performance enhancements, 2. floating point datatype options and defaults, 3. column order within merged results (definition as well as implementation of order).
  • (pertaining to change 1.) As-is, this PR I think may add greater complexity surrounding how Pandas merge operations function within merge_single_cells. Adding greater complexity with Pandas merge operations may be risky (lowering understandability, somewhat unique to Pandas data handling, etc) and could prove unsustainable for development in pycytominer. There may be other more desirable chunking options (for example, as suggested here and below).

Forward momentum:

  • (pertaining to change 2.) The floating point datatype options and defaults feel like the most clear way to address performance issues, accepting the caveat that this is a temporary and maybe sometimes scientifically unverified path forward. This may also be the quickest way to enable progress for Converting cell-health sqlite files to parquet pipeline WayScience/cell-health-data#10 (>=20 GB dataset resource exceptions).
    • Suggested action items:
      1. We open a new issue to outline the focus and need here
      2. We open a new PR for review with roughly the content of 298bf3a and b4634e2.
  • (pertaining to change 3.) Would it be valuable to programmatically address the column order of merge_single_cells results within pycytominer (for example, see comment)? As things stand, some of this is implied through Pandas merge and join operations. I'm uncertain about what the downstream effects might be to implementing changes here, but it may be a useful challenge to face in order to address general merge/join performance with reproducibility in mind.
    • Suggested action items:
      1. If it's valuable, we open a new issue to outline the focus and need here.
  • (pertaining to change 1.) The items above would provide greater definition for compartment merge performance considerations. Combining thoughts from other conversations, I wonder what performance gains might be possible through the use of an in-memory DuckDB client for SQLite data extraction and chunking as PyArrow Tables. This may only make sense in scenarios where we export to a file (as otherwise we may face too many penalties in cross-conversions).
    • Brainstorming: one somewhat in-place approach could involve adding an optional parameter for load_compartment to return Arrow (defaulting to Pandas). If Arrow is optionally used within merge_single_cells, we could define a chunked join to compare performance. It could also make sense to perform the join work without loading each compartment first. The optional nature of the in-memory data format selection could help address backwards compatibility (with Pandas) while acknowledging and striving for greater design scalability (with Arrow).

What are your thoughts? Should we move forward with any of the above as outlined? Please let me know if you have questions/suggestions/comments throughout.

@gwaybio
Copy link
Member

gwaybio commented Dec 16, 2022

Thanks @d33bs for catching me up!

What are your thoughts? Should we move forward with any of the above as outlined?

No. I feel that because of limited bandwidth, all of our efforts should be placed toward pycytominer-transform at the moment. I feel that once we have transform, we can start to deprecate SingleCells functionality, which will invariably make these short term patches a waste of time.

However, if you sense that any of these patches might be really quick, then let's do them! There is an unfortunate time trade-off here, and we need to be really thoughtful about were to allocate attention.

I'll close now and we'll use a short part of our meeting this morning to discuss any of these short-term pycytominer targets. Thanks again!

@gwaybio gwaybio closed this Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants