Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CellProfiler feature analysis #119

Open
gwaybio opened this issue Sep 29, 2021 · 29 comments
Open

CellProfiler feature analysis #119

gwaybio opened this issue Sep 29, 2021 · 29 comments
Labels
Discussion and Notes Documenting ideas/discussions Experiments Tracking experimental questions, results, or analysis

Comments

@gwaybio
Copy link
Member

gwaybio commented Sep 29, 2021

In a recent meeting, the team discussed a followup analysis for interrogating the signature features. We have already performed at least one feature analysis that defines all of the bz signature features (all features in the analysis) https://github.com/broadinstitute/profiling-resistance-mechanisms/blob/master/3.resistance-signature/5.visualize-signature-features.ipynb

Task 1 We'll also likely want to generate feature heatmaps (in a similar way that we define profile heatmaps) for only the bz signature features

Task 2 The final feature analysis ranks all features on their ability to separate the two classes (resistant vs. sensitive). We'll select the top discriminative features based on their ranks in the holdout set, but we should also perform this ranking procedure using all sets (training, test, validation, etc.).

The three above efforts nicely narrow down our search space from 1) all features, 2) bz features, 3) top bz features.

We can use this issue to elaborate on our approaches to form the heatmaps, to perform the ranking, and to determine effective visualizations.

@gwaybio gwaybio added Discussion and Notes Documenting ideas/discussions Experiments Tracking experimental questions, results, or analysis labels Sep 29, 2021
@AnneCarpenter
Copy link

Can you please elaborate on Task 1? What are the rows & columns of the heatmap?

For Task 2, I saw the task as just providing a rank ordered list where the rows are the ~40 bz singscore features and the values shown at the discriminative ability for the ~10 cell lines in the holdout set (not sure how to quantify - t test for WTs vs sensitives?) And repeat the list using the same features but quantifying ability to separate WTs vs sensitives for the training, test, validation (this would be a supplementary figure). I think you are saying the same thing but I don't know why you said to then "select the top discriminative features based on their ranks in the holdout set"; I think I was proposing just commenting on any particularly high or low scorers in the main text, not necessarily doing anything further with them.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 29, 2021

Task 1: Feature Heatmaps

Heatmap of profiles with normalized feature values for only BZ signature features (can include all features in sup fig if we want). We should include a hierarchical clustering dendrogram for both rows and columns

Rows: BZ signature features
Columns: All profiles (can be only holdout set if we want too)
Entries: normalized feature value per profile

This is similar to the heatmaps we previously generated, except those are pairwise correlations of profiles.

@gwaybio
Copy link
Member Author

gwaybio commented Sep 29, 2021

I think you are saying the same thing but I don't know why you said to then "select the top discriminative features based on their ranks in the holdout set"; I think I was proposing just commenting on any particularly high or low scorers in the main text, not necessarily doing anything further with them.

Yep, we're saying the same thing. "select the top features..." so we can report them in the main text/discussion

not sure how to quantify - t test for WTs vs sensitives?

Sounds reasonable to me. There are a couple ways to do so - but I think we should keep this simple

@AnneCarpenter
Copy link

Ah, so columns here are cell lines?

Rows: BZ signature features
Columns: All profiles (can be only holdout set if we want too)
Entries: normalized feature value per profile

@yhan8
Copy link
Contributor

yhan8 commented Oct 7, 2021

@gwaygenomics To generate the heatmap, I am looking to load the files that contain the normalized values of BZ signatures, as well as the training, testing, validation, and the holdout sets. Do you know where I shall look for these files?

I found signature_summary_bortezomib_signature.tsv.gz that contains the feature names, and tukey_results_bortezomib_signature.tsv that contains several numerical values, but I am not sure which values are the actual signature values.

@yhan8 yhan8 closed this as completed Oct 7, 2021
@yhan8 yhan8 reopened this Oct 7, 2021
@gwaybio
Copy link
Member Author

gwaybio commented Oct 7, 2021

I believe there are some functions (called load_data()) or something like it in some of the early signature notebooks that should point you in the right direction.

Also, we are not looking to plot the "signature values" (or, maybe I don't know what you mean by that) - we are looking to plot the normalized feature values for ONLY the signature features across the cell lines.

@yhan8
Copy link
Contributor

yhan8 commented Oct 8, 2021

I generated heatmaps for training, testing, validation, holdout, new batch datasets. The signature value was loaded from file ./data/bortezomib_signature_analytical_set.tsvand ./data/bortezomib_signature_analytical_set_newbatch.tsv. I believe these are normalized values, let me know if they are not.

There are 45 selected BZ signatures in total which are displayed column wise (x axis), row wise are the profiles (y axis labels can be ignored for now, as they are just row wise index numbers from the metadata DataFrame).

bz_sig_feature_heatmap

@gwaygenomics Greg, will you be able to help confirm these values are normalized?

@AnneCarpenter
Copy link

Is there a way to know which rows are which type of cell line (sensitive/resistant?)

@yhan8
Copy link
Contributor

yhan8 commented Oct 8, 2021

Is there a way to know which rows are which type of cell line (sensitive/resistant?)

I think there are some interesting patterns, but the story will be more clear after I rank the features (Task 2 above).

For the last image, it appears to be 'repetitive patterns', but I think it is because of the narrower data points, if you zoom in on the image, it is actually not repetitive.

bz_sig_feature_heatmap

@yhan8
Copy link
Contributor

yhan8 commented Oct 12, 2021

In a recent meeting, the team discussed a followup analysis for interrogating the signature features. We have already performed at least one feature analysis that defines all of the bz signature features (all features in the analysis) https://github.com/broadinstitute/profiling-resistance-mechanisms/blob/master/3.resistance-signature/5.visualize-signature-features.ipynb

Task 1 We'll also likely want to generate feature heatmaps (in a similar way that we define profile heatmaps) for only the bz signature features

Task 2 The final feature analysis ranks all features on their ability to separate the two classes (resistant vs. sensitive). We'll select the top discriminative features based on their ranks in the holdout set, but we should also perform this ranking procedure using all sets (training, test, validation, etc.).

The three above efforts nicely narrow down our search space from 1) all features, 2) bz features, 3) top bz features.

For task 2, I am trying to figure out a way to assign "importance scores" for all of the BZ features, for example, like random forest algorithm would assign a importance score for each selected feature. @gwaygenomics it appears that the BZ features were selected using a linear model? If so, I would assume there should be some kinds of coefficient scores for each feature and we could just rank the coefficient scores? We could also consider using T test to analyze the resistant and sensitive clone type classification accuracy scores, but how can I get a null distribution for each specific feature to do so? I am thinking I could use each feature to do the classification, and then permute the clone types randomly for 100 times and get 100 acc scores for each BZ feature? Is that the best approach though? Any inputs would be greatly appreciated! @shntnu

@shntnu
Copy link
Collaborator

shntnu commented Oct 12, 2021

I think you're on the right track with the first idea i.e. use either a model that has already been fit (if that's true that must be true because that's how we got the signature) or fit a new model (e.g. Random Forest) and determine importance.

The question to ponder is – what do you want to do differently here that is not already done in https://github.com/broadinstitute/profiling-resistance-mechanisms/blob/master/3.resistance-signature/5.visualize-signature-features.ipynb?

@gwaybio
Copy link
Member Author

gwaybio commented Oct 13, 2021

Ah, that's great @shntnu, thanks for pointing to this. @yhan8 - check out cell 13 and the column estimate. You'll just need to track down where the file is saved.

As far as the heatmaps:

  1. Yes, bortezomib_signature_analytical_set.tsv.gz is normalized. See
  2. We would really like to see more descriptive heatmaps. Solving Efforts in installing R package of ComplexHeatmap #115 and then rerunning with appropriate column labels will do the trick. The column labels and the dendrogram are the two important pieces lacking from CellProfiler feature analysis #119 (comment)

@yhan8
Copy link
Contributor

yhan8 commented Nov 4, 2021

I am trying to use Greg's code to accomplish task 1.

I am able to reproduce a heatmap similar to the previously generated heatmaps that includes the new batch data (i.e., otherclone) using pairwise correlations.

heatmap_bz_vs.all_bortezomibsignature.pdf

However, the goal is to make the heatmap to have:

  • Rows: BZ signature features
  • Columns: All profiles (can be only holdout set if we want too)
  • Entries: normalized feature value per profile

The logistic is simple, all BZ features are stored in signature_df, all profiles are stored in features_df. I should be able to replace this R syntax ht <- Heatmap(data, Colv = NA, Rowv = NA, scale="column") with

    # Obtain correlation matrix
    correlation_matrix_df <- t(subset_data_df %>% dplyr::select(!!!selected_features)) %>% cor() 

    # Generate the heatmap
    ht <- Heatmap(
        correlation_matrix_df,
        name = "Correlation",
        column_dend_side = "top",
        # To generate heatmaps sorted by signature score
        # row_order = order(subset_data_df$TotalScore),
        # column_order = order(subset_data_df$TotalScore),
        clustering_method_columns = "average",
        clustering_method_rows = "average",

to accomplish this, but could someone help point to a clearer direction (I am not very familiar with R and I've been trying to solve this for quite sometime now)?

@gwaybio
Copy link
Member Author

gwaybio commented Nov 4, 2021

I don't think we want a correlation matrix - we want profiles similar to as you have posted in #119 (comment), except we want to use ComplexHeatmap to cluster rows and columns, and highlight resistance status of samples.

I have found the ComplexHeatmap documentation extremely helpful in the past.

@gwaybio
Copy link
Member Author

gwaybio commented Nov 4, 2021

so maybe it is just replacing:

# What you have above
correlation_matrix_df <- t(subset_data_df %>% dplyr::select(!!!selected_features)) %>% cor()

# Maybe we want this:
profile_df <- subset_data_df %>% dplyr::select(!!!selected_features)

Would inputing profile_df into the complex heatmap code work? (you'll likely need to modify the input dataframe according to the complex heatmap documentation and cross-referencing how I input the correlation_matrix_df data)

@shntnu
Copy link
Collaborator

shntnu commented Nov 12, 2021

@yhan8 – I peeked into this after we chatted about it the day before.

The docs are great as @gwaygenomics says. In specific, I found this example https://jokergoo.github.io/ComplexHeatmap-reference/book/more-examples.html#visualize-cell-heterogeneity-from-single-cell-rnaseq that has both, a clustering of the similarity matrix as well as a clustering of the profile matrix.

You can peek into that to see how they did it, but my suggestion would be to start with the small example that IIRC you said already works and build up from there, moving towards the figure you want (take your time to understand it and get comfortable – this looks like a very useful library!)

This is the simple example I tested right now

library("ComplexHeatmap")
library(tidyverse)
x <-
  read_csv(
    "https://media.githubusercontent.com/media/broadinstitute/profiling-resistance-mechanisms/master/0.generate-profiles/profiles/2019_11_11_Batch4/WTmut04hTh/WTmut04hTh_normalized_feature_selected.csv.gz"
  )
Heatmap(
  x %>% select(-matches("Metadata")) %>% select(1:10) %>% as.matrix() %>% t(),
  top_annotation = HeatmapAnnotation(
    clone_number = x %>% pull(Metadata_clone_number),
    treatment = x %>% pull(Metadata_treatment)
  )
)

image

@yhan8
Copy link
Contributor

yhan8 commented Nov 17, 2021

Figure below is for Task 1.
Rows: BZ signature features
Columns: All profiles
Entries: normalized feature value per profile

Added the new batch dataset as well as otherclone
Screen Shot 2021-11-17 at 2 50 42 PM

@AnneCarpenter
Copy link

Exciting! Can you add another metadata row to indicate which samples are sensitive and which are resistant? I think it could replace the clone number color scheme - the colors are reused so much one cannot really see what's what anyway.

Folks (@gwaygenomics , I guess) - was our goal here to keep each clone's replicates as separate or would this be easier to comprehend if we merge replicates into one column per clone? If we keep them separate woudl it make sense to NOT cluster the data but instead put it in the exact order of the list?

(IIRC, the goal here is to get a visual sense of which features are most robust across clones or see any patterns that might emerge?)

@yhan8
Copy link
Contributor

yhan8 commented Nov 17, 2021

As @AnneCarpenter requested :)
Screen Shot 2021-11-17 at 4 51 36 PM

@AnneCarpenter
Copy link

Neat! In a perfect world we'd have seen sensitive on one side and resistant on the other, so in light of reality, let's try not clustering based on the data but instead sort first be sens/resistant and then by model_split to see how it looks?

@gwaybio
Copy link
Member Author

gwaybio commented Nov 17, 2021

🤩 looks great @yhan8

To me, the goal was to observe the feature dynamics - to get a sense of how many different patterns, or modules, exist in the feature space. There definitely appear to be several! (which is cool to see, one might have expected only 1 or 2 big blocks of red and blue)

A secondary goal would be to visualize how those different feature modules differ across resistance status, so I think I agree that we want to see resistant and senstive on two sides. i.e. only use clustering for rows, not columns.

Yu, I think we also want to include TotalScore as a metadata column and not include it in the clustering.

@AnneCarpenter
Copy link

great catch noticing that feature in there! Yu, does it makes sense why it needs to be omitted?

@yhan8
Copy link
Contributor

yhan8 commented Nov 23, 2021

New figure below
Screen Shot 2021-11-23 at 12 28 04 PM
.

@yhan8
Copy link
Contributor

yhan8 commented Nov 23, 2021

Ah, that's great @shntnu, thanks for pointing to this. @yhan8 - check out cell 13 and the column estimate. You'll just need to track down where the file is saved.

For task 2, I printed out the top 10 BZ feature ranks among the 45 total BZ features.

Screen Shot 2021-11-23 at 12 58 08 PM

@gwaygenomics Could you explain a bit more on how the estimate score is calculated? What specific methods/algorithms were adopted? My understanding is that it does not involve any fancy ML algorithm etc., it is more of a simple pair-wise correlation metrics?

@AnneCarpenter
Copy link

I noticed something weird in the feature heatmap: we have both Cytoplasm_Correlation_K_AGP_DNA (top) and Cytoplasm_Correlation_K_DNA_AGP (at the bottom)… they appear to have exact opposite values but I'm confused why one of them wouldn't have been eliminated in the feature selection since they seem to be identical (well, exactly opposite, which is also confusing to me!) Once one of them was in the signature, the 2nd one could not have added any info unless I misunderstand how features were picked for the BZ score. Probably needs @gwaygenomics insight.

@AnneCarpenter
Copy link

Side note: I marked (roughly) those 10 features on the map and I upload it here. I've not digested this whole task/thread thoroughly but it's odd that the 10 features are so similar to each other (I'd thought the point was to find the 10 most-helpful features, again having removed redundant ones)
143074303-54c98467-b57a-4c94-a00b-b7e16fdc1e19

@bethac07
Copy link

So the K coefficients have a squared component in their denominator (see below), so if one image is very dim and one image is very bright (which I would expect would be what we'd typically have in the Cytoplasm for DNA and Actin), I think it's very feasible that the "inverse" features might not have a Pearson correlation >0.9, because the nonlinearities will be significant.

(https://github.com/CellProfiler/CellProfiler/blob/a90e17e4d258c6f3900238be0f828e0b4bd1b293/cellprofiler/modules/measurecolocalization.py#L512-L513 ; fi is "first image" and si is "second image)

K1 = (fi_thresh * si_thresh).sum() / (fi_thresh ** 2).sum()
K2 = (fi_thresh * si_thresh).sum() / (si_thresh ** 2).sum()

@bethac07
Copy link

This was bugging me, so I checked, and my intuition was right, they might be highly correlated but could definitely slip between 0.9

import numpy
import scipy.stats
import seaborn as sns

fi = numpy.random.normal(loc=0.5,scale=0.1,size=1000) #bright image
si = numpy.random.normal(loc=0.05,scale=0.01,size=1000) #dim image


k1 = (fi*si)/(fi**2)

k2 = (fi*si)/(si**2)

ax = sns.regplot(x=k1,y=k2)
ax.annotate("Pearson = %3f "%scipy.stats.pearsonr(k1,k2)[0], xy=(0.1, -8))

image

@yhan8
Copy link
Contributor

yhan8 commented Nov 24, 2021

New figure below
Screen Shot 2021-11-23 at 12 28 04 PM
.

In the total score metadata, the grey color scheme seems to be associated with the otherclone split. As far as now, this figure does not tell a sensible story for the manuscript, but if we do decide to include it in the manuscript later, this needs to be fixed. Possible reasons could be either that the total score did not get imported to the otherclone, or some R library color bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion and Notes Documenting ideas/discussions Experiments Tracking experimental questions, results, or analysis
Projects
None yet
Development

No branches or pull requests

5 participants