Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operating system kernel kills merge_single_cells() process due to out of memory error #233

Open
axiomcura opened this issue Sep 30, 2022 · 0 comments

Comments

@axiomcura
Copy link
Member

Hello everyone,

I have been using the newly developed merge_single_cells() (introduced in #219) to convert my .sqlite files into parquet.

I am currently running into an issue where large files (specifically over 20GB) are being killed by Operating System due to out of memory (OOM) error.

Killed Message

The image shows that some sqlite files were successfully converted into parquet files but the OS kernel raised a killed message on ../data/SQ00014611.sqlite.

To figure out what error caused the OS kernel to kill the process, I typed:

sudo dmesg -T| grep -E -i -B100 'killed process'

and returned:

Out of memory: Killed process 137056 (python) total-vm:58479492kB

Based on the message it, seems the .merge_single_cells() at some point uses 58.5GB of memory on a 20.78GB file.

Here is an image of PopOS resource monitor when using .merge_single_cells() on ../data/SQ00014611.sqlite. dataset a few seconds before the OS kernel killing the job and raising OOM error:

 PopOS Monitor

Below is the source code used and a download link that points to the dataset that causes this issue.

SQ00014611 dataset download link

# creating single cell object
sqlite_path = "./SQ00014611.sqlite"
sql_url = f"sqlite:///{sqlite_path}"
sc_p = SingleCells(
    sql_url,
    strata=["Image_Metadata_Plate", "Image_Metadata_Well"],
    image_cols=["TableNumber", "ImageNumber"],
)

# setting up output paths and naming
f_name = Path(sqlite_path).stem
save_dir = Path("../cell-health-parquet-data")
save_dir.mkdir(exist_ok=True)
save_path = save_dir / f"{f_name}.parquet"

# converting to parquet files
sc_p.merge_single_cells(
    sc_output_file=save_path,
    output_type="parquet",
    join_on=["Metadata_well_position", "Image_Metadata_Well"],
)

Memory benchmarking

I have conducted a memory profile using memory-profiler , to locate where the high memory usage is occurring.

The approach is similar to the one conducted in #195 post

This memory profile was done with a smaller dataset SQ00014613.sqlite in order to generate a complete profile.

# enter memory profiler here
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    31    159.8 MiB    159.8 MiB           1   @profile
    32                                         def to_parquet_memory_profile():
    33    159.8 MiB      0.0 MiB           1       sql_url = f"sqlite:///{FILE}"
    34    306.7 MiB    146.9 MiB           2       sc_p = SingleCells(
    35    159.8 MiB      0.0 MiB           1           sql_url,
    36    159.8 MiB      0.0 MiB           1           strata=["Image_Metadata_Plate", "Image_Metadata_Well"],
    37    159.8 MiB      0.0 MiB           1           image_cols=["TableNumber", "ImageNumber"],
    38    159.8 MiB      0.0 MiB           1           fields_of_view_feature=[]
    39                                             )
    40                                             
    41                                             # function parameters
    42    306.7 MiB      0.0 MiB           1       sc_output_file: str = ".",
    43    306.7 MiB      0.0 MiB           1       compute_subsample: bool = False
    44    306.7 MiB      0.0 MiB           1       single_cell_normalize=False
    45    306.7 MiB      0.0 MiB           1       compression_options: Optional[str] = None
    46    306.7 MiB      0.0 MiB           1       float_format: Optional[str] = None
    47    306.7 MiB      0.0 MiB           1       single_cell_normalize: bool = False
    48    306.7 MiB      0.0 MiB           1       normalize_args: Optional[Dict] = None
    49    306.7 MiB      0.0 MiB           1       platemap: Optional[Union[str, pd.DataFrame]] = None
    50    306.7 MiB      0.0 MiB           1       kwargs = {"output_type":"parquet"}
    51                                         
    52    306.7 MiB      0.0 MiB           1       sc_df = ""
    53    306.7 MiB      0.0 MiB           1       linking_check_cols = []
    54    306.7 MiB      0.0 MiB           1       merge_suffix_rename = []
    55   9874.3 MiB      0.0 MiB           4       for left_compartment in sc_p.compartment_linking_cols:
    56   9874.3 MiB      0.0 MiB           7           for right_compartment in sc_p.compartment_linking_cols[left_compartment]:
    57                                                     # Make sure only one merge per combination occurs
    58   9874.3 MiB      0.0 MiB           4               linking_check = "-".join(sorted([left_compartment, right_compartment]))
    59   9874.3 MiB      0.0 MiB           4               if linking_check in linking_check_cols:
    60   9874.3 MiB      0.0 MiB           2                   continue
    61                                         
    62                                                     # Specify how to indicate merge suffixes
    63   6629.5 MiB      0.0 MiB           2               merge_suffix = [
    64   6629.5 MiB      0.0 MiB           2                   "_{comp_l}".format(comp_l=left_compartment),
    65   6629.5 MiB      0.0 MiB           2                   "_{comp_r}".format(comp_r=right_compartment),
    66                                                     ]
    67   6629.5 MiB      0.0 MiB           2               merge_suffix_rename += merge_suffix
    68   6629.5 MiB      0.0 MiB           4               left_link_col = sc_p.compartment_linking_cols[left_compartment][
    69   6629.5 MiB      0.0 MiB           2                   right_compartment
    70                                                     ]
    71   6629.5 MiB      0.0 MiB           4               right_link_col = sc_p.compartment_linking_cols[right_compartment][
    72   6629.5 MiB      0.0 MiB           2                   left_compartment
    73                                                     ]
    74                                         
    75   6629.5 MiB      0.0 MiB           2               if isinstance(sc_df, str):
    76   3359.6 MiB   3052.9 MiB           1                   sc_df = sc_p.load_compartment(compartment=left_compartment)
    77                                         
    78   3359.6 MiB      0.0 MiB           1                   if compute_subsample:
    79                                                             # Sample cells proportionally by sc_p.strata
    80                                                             sc_p.get_subsample(df=sc_df, rename_col=False)
    81                                         
    82                                                             subset_logic_df = sc_p.subset_data_df.drop(
    83                                                                 sc_p.image_df.columns, axis="columns"
    84                                                             )
    85                                         
    86                                                             sc_df = subset_logic_df.merge(
    87                                                                 sc_df, how="left", on=subset_logic_df.columns.tolist()
    88                                                             ).reindex(sc_df.columns, axis="columns")
    89                                         
    90   6629.5 MiB     54.1 MiB           2                   sc_df = sc_df.merge(
    91   6575.4 MiB   3215.7 MiB           1                       sc_p.load_compartment(compartment=right_compartment),
    92   6575.4 MiB      0.0 MiB           1                       left_on=sc_p.merge_cols + [left_link_col],
    93   6575.4 MiB      0.0 MiB           1                       right_on=sc_p.merge_cols + [right_link_col],
    94   6575.4 MiB      0.0 MiB           1                       suffixes=merge_suffix,
    95                                                         )
    96                                         
    97                                                     else:
    98   9874.3 MiB     79.9 MiB           2                   sc_df = sc_df.merge(
    99   9794.4 MiB   3164.9 MiB           1                       sc_p.load_compartment(compartment=right_compartment),
   100   9794.4 MiB      0.0 MiB           1                       left_on=sc_p.merge_cols + [left_link_col],
   101   9794.4 MiB      0.0 MiB           1                       right_on=sc_p.merge_cols + [right_link_col],
   102   9794.4 MiB      0.0 MiB           1                       suffixes=merge_suffix,
   103                                                         )
   104                                         
   105   9874.3 MiB      0.0 MiB           2               linking_check_cols.append(linking_check)
   106                                         
   107                                             # Add metadata prefix to merged suffixes
   108   9874.3 MiB      0.0 MiB           1       full_merge_suffix_rename = []
   109   9874.3 MiB      0.0 MiB           1       full_merge_suffix_original = []
   110   9874.3 MiB      0.0 MiB           6       for col_name in sc_p.merge_cols + list(sc_p.linking_col_rename.keys()):
   111   9874.3 MiB      0.0 MiB           5           full_merge_suffix_original.append(col_name)
   112   9874.3 MiB      0.0 MiB           5           full_merge_suffix_rename.append("Metadata_{x}".format(x=col_name))
   113                                         
   114   9874.3 MiB      0.0 MiB           6       for col_name in sc_p.merge_cols + list(sc_p.linking_col_rename.keys()):
   115   9874.3 MiB      0.0 MiB          20           for suffix in set(merge_suffix_rename):
   116   9874.3 MiB      0.0 MiB          15               full_merge_suffix_original.append("{x}{y}".format(x=col_name, y=suffix))
   117   9874.3 MiB      0.0 MiB          30               full_merge_suffix_rename.append(
   118   9874.3 MiB      0.0 MiB          15                   "Metadata_{x}{y}".format(x=col_name, y=suffix)
   119                                                     )
   120                                         
   121   9874.3 MiB      0.0 MiB           2       sc_p.full_merge_suffix_rename = dict(
   122   9874.3 MiB      0.0 MiB           1           zip(full_merge_suffix_original, full_merge_suffix_rename)
   123                                             )
   124                                         
   125                                             # Add image data to single cell dataframe
   126   9874.3 MiB      0.0 MiB           1       if not sc_p.load_image_data:
   127                                                 sc_p.load_image()
   128                                                 sc_p.load_image_data = True
   129                                         
   130   9826.8 MiB  -9314.6 MiB           1       sc_df = (
   131  19141.4 MiB   9267.1 MiB           3           sc_p.image_df.merge(sc_df, on=sc_p.merge_cols, how="right")
   132                                                 # pandas rename performance may be improved using copy=False, inplace=False
   133                                                 # reference: https://ryanlstevens.github.io/2022-05-06-pandasColumnRenaming/
   134  19141.4 MiB      0.0 MiB           1           .rename(
   135  19141.4 MiB      0.0 MiB           1               sc_p.linking_col_rename, axis="columns", copy=False, inplace=False
   136  19141.4 MiB      0.0 MiB           1           ).rename(
   137  19141.4 MiB      0.0 MiB           1               sc_p.full_merge_suffix_rename, axis="columns", copy=False, inplace=False
   138                                                 )
   139                                             )
   140   9826.8 MiB      0.0 MiB           1       if single_cell_normalize:
   141                                                 # Infering features is tricky with non-canonical data
   142                                                 if normalize_args is None:
   143                                                     normalize_args = {}
   144                                                     features = infer_cp_features(sc_df, compartments=sc_p.compartments)
   145                                                 elif "features" not in normalize_args:
   146                                                     features = infer_cp_features(sc_df, compartments=sc_p.compartments)
   147                                                 elif normalize_args["features"] == "infer":
   148                                                     features = infer_cp_features(sc_df, compartments=sc_p.compartments)
   149                                                 else:
   150                                                     features = normalize_args["features"]
   151                                         
   152                                                 normalize_args["features"] = features
   153                                         
   154                                                 sc_df = normalize(profiles=sc_df, **normalize_args)
   155                                         
   156                                             # In case platemap metadata is provided, use pycytominer.annotate for metadata
   157   9826.8 MiB      0.0 MiB           1       if platemap is not None:
   158                                                 sc_df = annotate(
   159                                                     profiles=sc_df, platemap=platemap, output_file="none", **kwargs
   160                                                 )
   161                                         
   162                                             return sc_df

Most of the memory consumption is occurring within the merging operations [lines 130-137].

My guess is that the merging function still retains the non-merged and merged DataFrames. This can become an issue as the merged DataFrame continues to require more memory over time, while still reserving memory for the non-merged datasets.

Once the merging function is complete, there is a giant drop of memory usage in line 140.

This stack overflow post describes a similar issue with the merge function.

I could be wrong, this is my intuition by just looking at the memory profiler report.

Machine info

  • OS: Pop!_OS 22.04 LTS
  • CPU: AMD Ryzen Threadripper 2970WX 24-Core Processor
  • Memory: 64 GB of DDR4 RAM
  • GPU: 2x NVIDIA RTX 3060
    @gwaybio @d33bs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant