Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

begin EDA of sc time-lapse profiles #5

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 21 additions & 17 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,39 +1,43 @@
---
repos:
- repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update
rev: v0.3.3post1 # Insert the latest tag here
hooks:
- id: pre-commit-update
args: [--exclude, black, --keep, isort]
# Formats import order
- repo: https://github.com/pycqa/isort
- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort
- id: isort
name: isort (python)
args: ["--profile", "black", "--filter-files"]

# Code formatter for both python files and jupyter notebooks
- repo: https://github.com/psf/black
- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black-jupyter
- id: black
- id: black-jupyter
- id: black
language_version: python3.10

- repo: https://github.com/nbQA-dev/nbQA
- repo: https://github.com/nbQA-dev/nbQA
rev: 1.8.5
hooks:
- id: nbqa-isort
additional_dependencies: [isort==5.6.4]
args: [--profile=black]
- id: nbqa-isort
additional_dependencies: [isort==5.6.4]
args: [--profile=black]

# remove unused imports
- repo: https://github.com/hadialqattan/pycln.git
- repo: https://github.com/hadialqattan/pycln.git
rev: v2.4.0
hooks:
- id: pycln
- id: pycln

# additional hooks found with in the pre-commit lib
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
hooks:
- id: trailing-whitespace # removes trailing white spaces
- id: mixed-line-ending # removes mixed end of line
- id: trailing-whitespace # removes trailing white spaces
- id: mixed-line-ending # removes mixed end of line
args:
- --fix=lf
- --fix=lf
Binary file added 0.EDA/figures/CP/umap_centroid_plot.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the NA label here? Consider changing this label

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common to use the means of the UMAP space in this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does plotting the means of the UMAP components tell us?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing the x and y axis label to reflect the mean of the UMAP components (same for the other mean umap plots as well)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing all of the time points here. If they're not present, consider removing them from the legend (same for the other mean umap plots as well)

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 0.EDA/figures/CP/umap_plot_time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 0.EDA/figures/CP/umap_plot_time_part_of_doses.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same figure as the top row of the 0.EDA/figures/CP/umap_plot_time.png figure? If so, consider only keeping one copy to be more DRY.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 0.EDA/figures/combined/umap_centroid_plot.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is difficult for me to distinguish between some of these time points. For example, between the 360 min and 330 min time points. Consider changing the color scheme to make the colors more distinguishable

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 0.EDA/figures/combined/umap_plot_time.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider including a title describing details of the experiment, such as the compound used
(same comment for 0.EDA/figures/combined/umap_plot_time_part_of_doses.png)

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these visualizations also present in 0.EDA/figures/combined/umap_plot_time.png If so, consider only keeping one copy to be more DRY.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 0.EDA/figures/scDINO/umap_centroid_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26,591 changes: 26,591 additions & 0 deletions 0.EDA/notebooks/0.generate_umap_embeddings.ipynb

Large diffs are not rendered by default.

513 changes: 513 additions & 0 deletions 0.EDA/notebooks/1.visualize_umaps.ipynb

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions 0.EDA/run_eda.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

# this script is used to run the EDA process

# activate the conda environment
mamba activate timelapse_env

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts notebooks/*.ipynb

cd scripts

run the EDA script
python 0.generate_umap_embeddings.py --data_mode "CP"
python 0.generate_umap_embeddings.py --data_mode "scDINO"
python 0.generate_umap_embeddings.py --data_mode "combined"

# deactivate the conda environment
mamba deactivate

mamba activate R_timelapse_env

Rscript 1.visualize_umaps.r --data_mode "CP"
Rscript 1.visualize_umaps.r --data_mode "scDINO"
Rscript 1.visualize_umaps.r --data_mode "combined"

mamba deactivate

cd ../

# end of script
echo "EDA process completed"
121 changes: 121 additions & 0 deletions 0.EDA/scripts/0.generate_umap_embeddings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
#!/usr/bin/env python
# coding: utf-8

# This notebook generates the umap embeddings of the images in the dataset. The embeddings are saved in a parquet file.

# In[1]:


import argparse
import pathlib

import numpy as np
import pandas as pd
import umap

# In[ ]:


# set the arg parser
parser = argparse.ArgumentParser(description="UMAP on a matrix")

parser.add_argument("--data_mode", type=str, default="CP", help="data mode to use")

# get the args
args = parser.parse_args()

# set data mode to either "CP" or "scDINO" or "combined"
data_mode = args.data_mode


# In[2]:


# set the paths to the data
CP_fs_sc_profiles_path = pathlib.Path(
"../../data/feature_selected_data/run_20230920ChromaLiveTL_24hr4ch_MaxIP_norm_fs.parquet"
).resolve(strict=True)
scDINO_sc_profiles_path = pathlib.Path(
"../../data/outputdir/apoptosis_timelapse/CLS_features/CLS_features_annotated.parquet"
).resolve(strict=True)
combined_profiles_path = pathlib.Path(
"../../data/20231017ChromaLive_6hr_4ch_MaxIP_normalized_combined_data_feature_selected.parquet"
).resolve(strict=True)

output_path = pathlib.Path(f"../../data/umap/").resolve()
output_path.mkdir(parents=True, exist_ok=True)


# In[3]:


if data_mode == "CP":
# read the data
profiles_df = pd.read_parquet(CP_fs_sc_profiles_path)
elif data_mode == "scDINO":
# read the data
profiles_df = pd.read_parquet(scDINO_sc_profiles_path)
elif data_mode == "combined":
# read the data
profiles_df = pd.read_parquet(combined_profiles_path)
else:
raise ValueError("data_mode must be either 'CP' or 'scDINO' or 'combined'")
print(profiles_df.shape)
# show all columns
pd.set_option("display.max_columns", None)
profiles_df.head()


# In[4]:


# filter the data and drop nan values
print(profiles_df.shape)
# drop nan values in non metadata columns
profiles_df = profiles_df.dropna(
subset=profiles_df.columns[~profiles_df.columns.str.contains("Meta")]
)
print(profiles_df.shape)


# In[5]:


# get the metadata columns
metadata_cols = profiles_df.columns.str.contains("Metadata_")
metadata_df = profiles_df.loc[:, metadata_cols]
features_df = profiles_df.loc[:, ~metadata_cols]

# set the umap parameters
umap = umap.UMAP(
n_neighbors=15,
n_components=2,
metric="euclidean",
random_state=42,
min_dist=0.1,
n_epochs=500,
learning_rate=1,
)

# fit the umap model
umap.fit(features_df)

# transform the data
umap_transformed = umap.transform(features_df)

# create a dataframe with the transformed data
umap_df = pd.DataFrame(
umap_transformed, columns=["UMAP0", "UMAP1"], index=features_df.index
)

# combine the metadata and umap dataframes
umap_df = pd.concat([metadata_df, umap_df], axis=1)
print(umap_df.shape)
umap_df.head()


# In[ ]:


# save the umap dataframe
umap_df.to_parquet(f"../../data/umap/{data_mode}_umap_transformed.parquet")
Loading