`annotate` function kills kernel when trying to process a 13.1GB file #275

jenna-tomkinson · 2023-04-26T20:15:42Z

This issue is related to issue #233.

I created a group of Python files to:

Convert the two CellProfiler SQLite outputs into parquet and merge single cells using CytoTable convert.
Merge the two parquet files into one using pandas concat because the two outputs should be the same file but were split due to a power outage stopping the CellProfiler run.
Annotate the new combined parquet file with Pycytominer annotate
Perform normalization with Pycytominer normalize
Perform feature selection with Pycytominer feature_select

When the two parquet files were merged, the new parquet file is 13.1GB:

When attempting to run the scripts as described above, the kernel would be killed when attempting to run the annotate function:

This means that this function attempted to use about 102GB, while I only have about 49GB.

After talking with @axiomcura, he believes the issue might be arising in this part of the annotate function:

 if isinstance(external_metadata, pd.DataFrame):
        external_metadata.columns = [
            "Metadata_{}".format(x) if not x.startswith("Metadata_") else x
            for x in external_metadata.columns
        ]

        annotated = (
            annotated.merge(
                external_metadata,
                left_on=external_join_left,
                right_on=external_join_right,
                how="left",
            )
            .reset_index(drop=True)
            .drop_duplicates()
        )

Machine info

    OS: Pop!_OS 22.04 LTS
    CPU: AMD Ryzen 7 3700X 8-Core Processor
    Memory: 64 GB of RAM

@gwaybio @d33bs

The text was updated successfully, but these errors were encountered:

gwaybio · 2023-04-27T02:32:13Z

Are you using the external_metadata argument? That is somewhat of a rare argument to use.

jenna-tomkinson · 2023-04-27T14:58:46Z

My apologies; I forgot to add the code that I was using. Here it is:

# add metadata from platemap file to extracted single cell features
    annotated_df = annotate(
        profiles=single_cell_df,
        platemap=platemap_df,
        join_on=["Metadata_well_id", "Image_Metadata_Well"],
    )

    # move metadata well and single cell count to the front of the df (for easy visualization in python)
    well_column = annotated_df.pop("Metadata_Well")
    singlecell_column = annotated_df.pop("Metadata_number_of_singlecells")
    # insert the column as the second index column in the dataframe
    annotated_df.insert(1, "Metadata_Well", well_column)
    annotated_df.insert(2, "Metadata_number_of_singlecells", singlecell_column)

    # save annotated df as parquet file
    output(
        df=annotated_df,
        output_filename=output_file,
        output_type="parquet",
    )

I assumed that the pd.pop nor the pd.insert would cause for the memory issue. Output also doesn't seem like the culprit but I am open to ideas.

When I have bandwidth, I can try the memory-profiler method from #233.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`annotate` function kills kernel when trying to process a 13.1GB file #275

`annotate` function kills kernel when trying to process a 13.1GB file #275

jenna-tomkinson commented Apr 26, 2023

gwaybio commented Apr 27, 2023

jenna-tomkinson commented Apr 27, 2023

annotate function kills kernel when trying to process a 13.1GB file #275

annotate function kills kernel when trying to process a 13.1GB file #275

Comments

jenna-tomkinson commented Apr 26, 2023

Machine info

gwaybio commented Apr 27, 2023

jenna-tomkinson commented Apr 27, 2023

`annotate` function kills kernel when trying to process a 13.1GB file #275

`annotate` function kills kernel when trying to process a 13.1GB file #275