-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of initial clusters influences number of significant merged clusters #1
Comments
Hey @mihem, Thanks alot for being the first person to try this out! (That I know of). And thanks alot for details with regards to getting it ported over to R. I will look into writing a tutorial based on utilising reticulate within R itself to run Cytocipher, so don't need to convert and save to .csv. It would still require transferring over some results from the AnnData object to your Seurat object however, but it's something which should not take me long to setup. I'll update this thread when I try in the next few days. Downsampling approach makes complete sense, I did the same for Tabula Sapiens and got very comparable results to using the full dataset with MUCH faster run time. I'm happy to see those diagnostics plots are working well; can see when it's doing the right versus wrong thing quite quickly! There are some important recommendations on correct run usage in the Pancreas analysis tutorial, in section "Running Cytocipher to determine significantly different clusters". One point from here I'll highlight:
More specifically to evaluate initial clusters:
For the marker genes issue, I should update that warning statement, it is not a problem at all to have full marker gene overlap (in-fact, expected in-cases of overclusters!). Instead, could you try this after updating the initial clusters with the recommendations above?: So what squash_exception is doing works in the cc.tl.code_score, from that documentation:
These two bits of documentation need to be updated, the default should be squash_exception=True. When I was developing it, I had it as True because I was interested in how it would perform for that edge case, and found it simply means that if there are no marker genes which differentiate two clusters, it then only effectively declares them different if there is a difference in magnitude of expression for those marker genes between the respective cluster pairs tested. For your questions:
Hopefully this helps, I am adding a few items to my TODO list:
Let me know if this resolves the issue! Best, |
Hey @BradBalderson , thanks a lot for your extensive and very informative response. Sure, it's great to be the first user apparently. I think this package has tons of potential users, especially if you make it easily accessible also for R (especially Seurat) users (as for example harmony did that https://portals.broadinstitute.org/harmony/SeuratV3.html). I am actually okay with exporting the cluster_merged column and importing it into Seurat. Very fast and robust. For Seurat users it would be of course nice if you don't have to leave R and can create and save your plots there, as you are used to. In fact, in the last few days I've been working on using the countsplit approach to find the optimal number of clusters https://arxiv.org/abs/2307.12985 . This worked quite well with the pbmc 10x dataset anna-neufeld/countsplit#8. The large advantage of Cytocipher is that it merges clusters and does not only rely on Leiden Clustering as you show very nicely in Figure 4. But it is a very strong statistical idea I think, so maybe one could multifold negative binominal countsplit with the merged clusteres (based on Cytocipher) and see that this minimized the within cluster sum of squares error? Regarding the questions from the previous post:
I used the downsampled Seurat dataset with downsampling it to 1000. So the clusters are as shown previously:
I then overclustered with a Leiden resolution of 3.0. Then enriched heatmap looked like this. The overclustering doesn't look as impressive here as in your picture, but I think it's notheless sufficiently overclustered? Side note: I am unable to save the enriched heatmap. I use R 99% of my time, so it could be my incompetence. But e.g. the
So the following works:
While:
So figure.savefig does not work. I merged the clustered with
The enriched heatmap of the merged clusters looks like this: The umap looks unfortunately very weid again. Very unplausible merging, both visually and biologically. Do you have any idea what the problem could be? I also tried merge_clusteres with enrich_method "coexpr". This looked very different, much better than the other one but pretty sure overclustered: As a second approach, I used the randomly downsampled dataset, which has cluter sizes like this (so some very small one like B, but well defined based on marker genes and biologically of course different from adjacent macrophage cluster.
When calling the merge_cluster function (with quash_exception True), this lead to the following error:
I had to increase marker_padj_cutoff and p_cut a lot, plus reducing t_cutoff a lot and then the function finally worked:
However, when I wanted to run further function like umap with the cluster_merge i got the following error:
I will send you a link the Seurat data via mail and hope you find time to have a look at this yourself, which is probably easier. Also I just realized: I do all the pre-processing in Seurat (using log normalization). This should not be a problem right? Thanks a lot! |
@BradBalderson Congratulation to this package. I think this is a very much needed tool for the sc community.
I can confirm that it scales much better to large dataset than scSHC. And the visualization you provide are nice and helpful.
I am a Seurat user but conversion from Seurat to AnnData via
SeuratDisk
https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html has worked fine for me here (at least with Seurat v4). Converting AnnData to Seurat was more problematic. This didn't work for me withSeuratDisk
. sceasy https://github.com/cellgeni/sceasy worked, but lacked some data. anndata2ri https://github.com/theislab/anndata2ri worked for me previously but was problematic here because r2py didn't detect my R paths. So in the end I just saved the merged cluster as a .csv file and added this to my Seurat metadata, which is easy, fast and sufficient for the downstream analysis I think.However, I had some strange results depending on the number of initital clusters.
To reduce the computation speed, I downsampled my dataset of initially ~170k Downsampling via Seurat's internal subset function actually downsamples per cluster, so I downsamples to 1000 per cluster, which resulted in ~17k clusters with 19 clusters (some had less than 1000 cells so were not downsamples at all).
original dataset
downsampled dataset
These 19 cluster make sense visually and biologically, althoughone could argue that a few are overclustered.
The. resulting merged cluster with a p cutoff 0.05 looked reasonable in most cases (e.g. merge EC1 and EC2 ), but overly conservative in other cases (e.g. VSMC_PC1 and EpiC are very distinct cluters).
The volcano plot looked reasonable to me, so not sure it would make sense to increase the p value.
Based on your tutorial and manuscript jupyter notebooks, I thought that starting with overclustered dataset would make sense. I clustered the downsamples dataset with a resolution of 0.8 with Seurat's default Louvain algorithm. Seurat recommends 0.4-1.2, so actually it's not even that high. But it does look pretty overclustered in the UMAP to me.
However, this resulted in problems when using the default parameters
So i relaxed the parameters:
This worked, but lead to another issue:
So for merge i then used more markers
This worked, but the resulting merged UMAP clusters looked very strange.
I then though that maybe the results would look better with my entire dataset of 170k cells.
However, here with the same clustering as in the downsamples dataset (so not overclustered, resolution 0.4, so lower limit of what Seurat recommends), I encountered the
No marker genes for ..
issue.So my two main questions are:
Thank you!
The text was updated successfully, but these errors were encountered: