Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annotate function kills kernel when trying to process a 13.1GB file #275

Open
jenna-tomkinson opened this issue Apr 26, 2023 · 2 comments
Open

Comments

@jenna-tomkinson
Copy link
Member

This issue is related to issue #233.

I created a group of Python files to:

  1. Convert the two CellProfiler SQLite outputs into parquet and merge single cells using CytoTable convert.
  2. Merge the two parquet files into one using pandas concat because the two outputs should be the same file but were split due to a power outage stopping the CellProfiler run.
  3. Annotate the new combined parquet file with Pycytominer annotate
  4. Perform normalization with Pycytominer normalize
  5. Perform feature selection with Pycytominer feature_select

When the two parquet files were merged, the new parquet file is 13.1GB:

image

When attempting to run the scripts as described above, the kernel would be killed when attempting to run the annotate function:

image

This means that this function attempted to use about 102GB, while I only have about 49GB.

After talking with @axiomcura, he believes the issue might be arising in this part of the annotate function:

 if isinstance(external_metadata, pd.DataFrame):
        external_metadata.columns = [
            "Metadata_{}".format(x) if not x.startswith("Metadata_") else x
            for x in external_metadata.columns
        ]

        annotated = (
            annotated.merge(
                external_metadata,
                left_on=external_join_left,
                right_on=external_join_right,
                how="left",
            )
            .reset_index(drop=True)
            .drop_duplicates()
        )

Machine info

    OS: Pop!_OS 22.04 LTS
    CPU: AMD Ryzen 7 3700X 8-Core Processor
    Memory: 64 GB of RAM

@gwaybio @d33bs

@gwaybio
Copy link
Member

gwaybio commented Apr 27, 2023

Are you using the external_metadata argument? That is somewhat of a rare argument to use.

@jenna-tomkinson
Copy link
Member Author

My apologies; I forgot to add the code that I was using. Here it is:

# add metadata from platemap file to extracted single cell features
    annotated_df = annotate(
        profiles=single_cell_df,
        platemap=platemap_df,
        join_on=["Metadata_well_id", "Image_Metadata_Well"],
    )

    # move metadata well and single cell count to the front of the df (for easy visualization in python)
    well_column = annotated_df.pop("Metadata_Well")
    singlecell_column = annotated_df.pop("Metadata_number_of_singlecells")
    # insert the column as the second index column in the dataframe
    annotated_df.insert(1, "Metadata_Well", well_column)
    annotated_df.insert(2, "Metadata_number_of_singlecells", singlecell_column)

    # save annotated df as parquet file
    output(
        df=annotated_df,
        output_filename=output_file,
        output_type="parquet",
    )

I assumed that the pd.pop nor the pd.insert would cause for the memory issue. Output also doesn't seem like the culprit but I am open to ideas.

When I have bandwidth, I can try the memory-profiler method from #233.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants