-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CellProfiler feature analysis #119
Comments
Can you please elaborate on Task 1? What are the rows & columns of the heatmap? For Task 2, I saw the task as just providing a rank ordered list where the rows are the ~40 bz singscore features and the values shown at the discriminative ability for the ~10 cell lines in the holdout set (not sure how to quantify - t test for WTs vs sensitives?) And repeat the list using the same features but quantifying ability to separate WTs vs sensitives for the training, test, validation (this would be a supplementary figure). I think you are saying the same thing but I don't know why you said to then "select the top discriminative features based on their ranks in the holdout set"; I think I was proposing just commenting on any particularly high or low scorers in the main text, not necessarily doing anything further with them. |
Task 1: Feature HeatmapsHeatmap of profiles with normalized feature values for only BZ signature features (can include all features in sup fig if we want). We should include a hierarchical clustering dendrogram for both rows and columns Rows: BZ signature features This is similar to the heatmaps we previously generated, except those are pairwise correlations of profiles. |
Yep, we're saying the same thing. "select the top features..." so we can report them in the main text/discussion
Sounds reasonable to me. There are a couple ways to do so - but I think we should keep this simple |
Ah, so columns here are cell lines?
|
@gwaygenomics To generate the heatmap, I am looking to load the files that contain the normalized values of BZ signatures, as well as the training, testing, validation, and the holdout sets. Do you know where I shall look for these files? I found |
I believe there are some functions (called Also, we are not looking to plot the "signature values" (or, maybe I don't know what you mean by that) - we are looking to plot the normalized feature values for ONLY the signature features across the cell lines. |
I generated heatmaps for training, testing, validation, holdout, new batch datasets. The signature value was loaded from file There are 45 selected BZ signatures in total which are displayed column wise (x axis), row wise are the profiles (y axis labels can be ignored for now, as they are just row wise index numbers from the metadata DataFrame). @gwaygenomics Greg, will you be able to help confirm these values are normalized? |
Is there a way to know which rows are which type of cell line (sensitive/resistant?) |
I think there are some interesting patterns, but the story will be more clear after I rank the features (Task 2 above). For the last image, it appears to be 'repetitive patterns', but I think it is because of the narrower data points, if you zoom in on the image, it is actually not repetitive. |
For task 2, I am trying to figure out a way to assign "importance scores" for all of the BZ features, for example, like random forest algorithm would assign a importance score for each selected feature. @gwaygenomics it appears that the BZ features were selected using a linear model? If so, I would assume there should be some kinds of coefficient scores for each feature and we could just rank the coefficient scores? We could also consider using T test to analyze the resistant and sensitive clone type classification accuracy scores, but how can I get a null distribution for each specific feature to do so? I am thinking I could use each feature to do the classification, and then permute the clone types randomly for 100 times and get 100 acc scores for each BZ feature? Is that the best approach though? Any inputs would be greatly appreciated! @shntnu |
I think you're on the right track with the first idea i.e. use either a model that has already been fit ( The question to ponder is – what do you want to do differently here that is not already done in https://github.com/broadinstitute/profiling-resistance-mechanisms/blob/master/3.resistance-signature/5.visualize-signature-features.ipynb? |
Ah, that's great @shntnu, thanks for pointing to this. @yhan8 - check out cell 13 and the column As far as the heatmaps:
|
I am trying to use Greg's code to accomplish task 1. I am able to reproduce a heatmap similar to the previously generated heatmaps that includes the new batch data (i.e., otherclone) using pairwise correlations. heatmap_bz_vs.all_bortezomibsignature.pdf However, the goal is to make the heatmap to have:
The logistic is simple, all BZ features are stored in
to accomplish this, but could someone help point to a clearer direction (I am not very familiar with R and I've been trying to solve this for quite sometime now)? |
I don't think we want a correlation matrix - we want profiles similar to as you have posted in #119 (comment), except we want to use I have found the |
so maybe it is just replacing: # What you have above
correlation_matrix_df <- t(subset_data_df %>% dplyr::select(!!!selected_features)) %>% cor()
# Maybe we want this:
profile_df <- subset_data_df %>% dplyr::select(!!!selected_features) Would inputing |
@yhan8 – I peeked into this after we chatted about it the day before. The docs are great as @gwaygenomics says. In specific, I found this example https://jokergoo.github.io/ComplexHeatmap-reference/book/more-examples.html#visualize-cell-heterogeneity-from-single-cell-rnaseq that has both, a clustering of the similarity matrix as well as a clustering of the profile matrix. You can peek into that to see how they did it, but my suggestion would be to start with the small example that IIRC you said already works and build up from there, moving towards the figure you want (take your time to understand it and get comfortable – this looks like a very useful library!) This is the simple example I tested right now library("ComplexHeatmap")
library(tidyverse)
x <-
read_csv(
"https://media.githubusercontent.com/media/broadinstitute/profiling-resistance-mechanisms/master/0.generate-profiles/profiles/2019_11_11_Batch4/WTmut04hTh/WTmut04hTh_normalized_feature_selected.csv.gz"
)
Heatmap(
x %>% select(-matches("Metadata")) %>% select(1:10) %>% as.matrix() %>% t(),
top_annotation = HeatmapAnnotation(
clone_number = x %>% pull(Metadata_clone_number),
treatment = x %>% pull(Metadata_treatment)
)
) |
Exciting! Can you add another metadata row to indicate which samples are sensitive and which are resistant? I think it could replace the clone number color scheme - the colors are reused so much one cannot really see what's what anyway. Folks (@gwaygenomics , I guess) - was our goal here to keep each clone's replicates as separate or would this be easier to comprehend if we merge replicates into one column per clone? If we keep them separate woudl it make sense to NOT cluster the data but instead put it in the exact order of the list? (IIRC, the goal here is to get a visual sense of which features are most robust across clones or see any patterns that might emerge?) |
As @AnneCarpenter requested :) |
Neat! In a perfect world we'd have seen sensitive on one side and resistant on the other, so in light of reality, let's try not clustering based on the data but instead sort first be sens/resistant and then by model_split to see how it looks? |
🤩 looks great @yhan8 To me, the goal was to observe the feature dynamics - to get a sense of how many different patterns, or modules, exist in the feature space. There definitely appear to be several! (which is cool to see, one might have expected only 1 or 2 big blocks of red and blue) A secondary goal would be to visualize how those different feature modules differ across resistance status, so I think I agree that we want to see resistant and senstive on two sides. i.e. only use clustering for rows, not columns. Yu, I think we also want to include |
great catch noticing that feature in there! Yu, does it makes sense why it needs to be omitted? |
For task 2, I printed out the top 10 BZ feature ranks among the 45 total BZ features. @gwaygenomics Could you explain a bit more on how the |
I noticed something weird in the feature heatmap: we have both Cytoplasm_Correlation_K_AGP_DNA (top) and Cytoplasm_Correlation_K_DNA_AGP (at the bottom)… they appear to have exact opposite values but I'm confused why one of them wouldn't have been eliminated in the feature selection since they seem to be identical (well, exactly opposite, which is also confusing to me!) Once one of them was in the signature, the 2nd one could not have added any info unless I misunderstand how features were picked for the BZ score. Probably needs @gwaygenomics insight. |
So the K coefficients have a squared component in their denominator (see below), so if one image is very dim and one image is very bright (which I would expect would be what we'd typically have in the Cytoplasm for DNA and Actin), I think it's very feasible that the "inverse" features might not have a Pearson correlation >0.9, because the nonlinearities will be significant. (https://github.com/CellProfiler/CellProfiler/blob/a90e17e4d258c6f3900238be0f828e0b4bd1b293/cellprofiler/modules/measurecolocalization.py#L512-L513 ; fi is "first image" and si is "second image)
|
This was bugging me, so I checked, and my intuition was right, they might be highly correlated but could definitely slip between 0.9
|
In the |
In a recent meeting, the team discussed a followup analysis for interrogating the signature features. We have already performed at least one feature analysis that defines all of the bz signature features (all features in the analysis) https://github.com/broadinstitute/profiling-resistance-mechanisms/blob/master/3.resistance-signature/5.visualize-signature-features.ipynb
Task 1 We'll also likely want to generate feature heatmaps (in a similar way that we define profile heatmaps) for only the bz signature features
Task 2 The final feature analysis ranks all features on their ability to separate the two classes (resistant vs. sensitive). We'll select the top discriminative features based on their ranks in the holdout set, but we should also perform this ranking procedure using all sets (training, test, validation, etc.).
The three above efforts nicely narrow down our search space from 1) all features, 2) bz features, 3) top bz features.
We can use this issue to elaborate on our approaches to form the heatmaps, to perform the ranking, and to determine effective visualizations.
The text was updated successfully, but these errors were encountered: