Skip to content

Clustering

Kane edited this page Jul 22, 2022 · 11 revisions

Clustering is broadly accepted as an effective way to combine single cells into groups that represent some shared biological process, be that a cell type or alternatively a transcriptomic state (i.e. cell cycle, or IFN-response). Default clustering pipelines are usually sufficient to separate broad cell types, but leave some open questions or potential room for improvement.

How Many Clusters?

  • This is usually the first question asked by analysts or non-specialists alike when presented with scRNAseq clusters.
  • Ultimately there is no right answer. scRNAseq clustering can determine highly granular differences between cell types, but is limited by the number of cells available, the true underlying biological heterogeneity, and the signal-to-noise ratio.
  • A more useful question might be "how many clusters do I need?", and instead select a level of heterogeneity to allow for the desired downstream analysis (i.e. DE or DA).
  • Biology can also guide selecting an appropriate number of clusters, i.e. by selecting sufficient clusters to separate CD4/CD8 T cells

Permuting the number of clusters

  • Clustering with the leiden or louvain algorithms allows permuting a clustering *resolution8 to derive more or less clusters
  • Less clusters can be useful to determine broad composition of a dataset. Manually subclustering (see below) a low number of clusters can be an alternative to clustering all cells
  • Conversely, merging smaller clusters into broader meta-clusters can also curtail excessive clustering if two clusters are composed of very similar cells

Similar clusters

  • Over-clustering similar cells into multiple clusters can occur when permuting a high number of clusters.
  • A good way to see if clusters are transcriptionally similar is to plot a marker gene heatmap and see what genes or expression trends overlap between clusters, or to compute cluster pairwise transcriptional similarity between clusters.
  • Small clusters with similar key markers or a high degree of transcriptional similarity could then be manually merged.

Machine Learning Clustering Metrics

  • The ML field has several metrics to assess the quality clustering, i.e. Silhouette score.
  • These are discussed here
  • These metrics may have some usefulness, but are not commonly used for scRNAseq.
  • Some alternatives specific for scRNAseq have been developed, i.e. ROGUE

The number of clusters can also have an impact on whether to adopt differential expression (DE) or differential abundance (DA) analysis (which in scRNASeq, are two sides of the same coin).

  • Assume in IFN+ conditions a subset of B cells become activated. If you choose to under-cluster all B cells to a single cluster, cluster-level DA will not show this change, whereas DE will (as IFN-stimulated genes will be more highly expressed in the IFN condition). Conversely, if you over-cluster B cells into two clusters (one of which is composed of the IFN-activated cells, the other not), then DE within these clusters will show very little whereas, DA will highlight this IFN-activated cluster in the IFN condition.
  • Again, there is no real right answer here.

Subclustering

  • As clustering requires calculating highly variable genes (HVGs) within a group of cells, it can be beneficial to re-calculate these genes within a specific cell type (versus all types) prior to clustering it.
  • i.e. T cell clusters derived from HVGs calculated on whole PBMC will be able to detect less nuanced differences between T cells versus HVGs re-calculated on just T cells.
  • Multiple rounds of sub-clustering cell types can provide a better insight into biological phenotypes in a a dataset versus a single round of clustering all cells at once.

HVG Input (feature selection)

  • Several highly expressed gene groups can contribute to HVGs used as input for dimensionality reduction, i.e. mitochondrial, ribosomal, or immunoglobulin genes.
  • While this can reflect true biology (i.e. naive T cells possess an over-abundance of ribosomes), for the purposes of clustering on functional genes to a phenotype-level, these genes can sometimes be best ignored.
  • In some contexts, removing gene groups prior to clustering is imperative to biology. To cluster αβT cells by phenotype versus TCR clonotype, removing selected TCR genes (TRA/TRB/TRG) is required (see the T cell page for more info).

Neighborhood Graph Calculation

  • Leiden and louvain clustering algorithms and UMAP visualisation require calculating a neighborhood graph of cells (Seurat's FindNeighbors] and scanpy's pp.neighbors).
  • Reducing the size of the neighbourhood can produce a more local and granular clustering/UMAP, whereas increasing the size of the neighbourhood produces a more global clustering/UMAP.
  • Seurat: k.param: k for the k-nearest neighbor algorithm
  • scanpy: n_neighbors: size of local neighborhood
  • One caveat is more granular clustering can enhance subtle differences between samples due to minor batch effects (i.e. samples processed on different days or sharing a slight by distinct ambient RNA profile). The choice of enhancing biological resolution at the cost of increasing batch effect is left to you.

UMAP is for visualisation, not clustering

  • UMAP or tSNE is strictly a way to visualise data.
  • Differences between cells and clusters is always better-quantified by clustering algorithms versus visualisations, and care must be taken not to over-interpret biological similarity from the UMAP alone.
  • Major differences between cell lineage or to visualise stressed cells are however fairly well captured in UMAP or tSNE space.

Kane Foster (22-07-2022)