Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exercises #18

Merged
merged 6 commits into from
Aug 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 101 additions & 20 deletions episodes/cell_type_annotation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ editor_options:
---

::: questions
- How to identify groups of cells with similar expression profiles?
- How to identify genes that drive separation between these groups of cells?
- How can we identify groups of cells with similar expression profiles?
- How can we identify genes that drive separation between these groups of cells?
- How to leverage reference datasets and known marker genes for the cell type annotation of new datasets?
:::

Expand Down Expand Up @@ -38,12 +38,14 @@ library(scran)

## Data retrieval

We'll be using the same set of WT chimeric mouse embryo data:

```{r data, message = FALSE}
sce <- WTChimeraData(samples = 5, type = "processed")
sce
```

To speed up the computations, we subsample the dataset to 1,000 cells.
To speed up the computations, we take a random subset of 1,000 cells.

```{r}
set.seed(123)
Expand All @@ -53,6 +55,8 @@ sce <- sce[,ind]

## Preprocessing

The SCE object needs to contain log-normalized expression counts as well as PCA coordinates in the reduced dimensions, so we compute those here:

```{r preproc, warning = FALSE}
sce <- logNormCounts(sce)
sce <- runPCA(sce)
Expand Down Expand Up @@ -89,24 +93,36 @@ algorithm](https://doi.org/10.1088/1742-5468/2008/10/P10008) for
community detection. All calculations are performed using the top PCs to
take advantage of data compression and denoising. This function returns
a vector containing cluster assignments for each cell in our
`SingleCellExperiment` object.
`SingleCellExperiment` object. We use the `colLabels()` function to assign the
cluster labels as a factor in the column data.

```{r cluster}
colLabels(sce) <- clusterCells(sce, use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain"))

table(colLabels(sce))
```

We assign the cluster assignments back into our `SingleCellExperiment`
object as a `factor` in the column metadata. This allows us to
conveniently visualize the distribution of clusters in eg. a *t*-SNE or
a UMAP.
We can now overlay the cluster labels as color on a UMAP plot:

```{r cluster-viz}
sce <- runUMAP(sce, dimred = "PCA")

plotReducedDim(sce, "UMAP", color_by = "label")
```

:::: challenge

Our clusters look semi-reasonable, but what if we wanted to make them less granular? Look at the help documentation for `?clusterCells` and `?NNGraphParam` to find out what we'd need to change to get fewer, larger clusters.

::: solution

We see in the help documentation for `?clusterCells` that all of the clustering algorithm details are handled through the `BLUSPARAM` argument, which needs to provide a `BlusterParam` object (of which `NNGraphParam` is a sub-class). Each type of clustering algorithm will have some sort of hyper-parameter that controls the granularity of the output clusters. Looking at `?NNGraphParam` specifically, we see an argument called `k` which is described as "An integer scalar specifying the number of nearest neighbors to consider during graph construction." If the clustering process has to connect larger sets of neighbors, the graph will tend to be cut into larger groups, resulting in less granular clusters. Try the two code blocks above once more with `k = 20`. Given their visual differences, do you think one set of clusters is "right" and the other is "wrong"?
:::

::::


## Marker gene detection

To interpret clustering results as obtained in the previous section, we
Expand All @@ -121,7 +137,7 @@ testing for differential expression between clusters. If a gene is
strongly DE between clusters, it is likely to have driven the separation
of cells in the clustering algorithm.

Here, we perform a Wilcoxon rank sum test against a log2 fold change
Here, we use `findMarkers()` to perform a Wilcoxon rank sum test against a log2 fold change
threshold of 1, focusing on up-regulated (positive) markers in one
cluster when compared to another cluster.

Expand All @@ -131,6 +147,8 @@ markers <- findMarkers(sce, test.type = "wilcox", direction = "up", lfc = 1)
markers
```

<!-- TODO change this ^ to scoreMarkers() -->

The resulting object contains a sorted marker gene list for each
cluster, in which the top genes are those that contribute the most to
the separation of that cluster from mall other clusters.
Expand Down Expand Up @@ -159,6 +177,22 @@ top.markers <- head(rownames(markers[[1]]))
plotExpression(sce, features = top.markers, x = "label", color_by = "label")
```

Clearly, not every marker gene distinguishes cluster 1 from every other cluster. However, with a combination of multiple marker genes it's possible to clearly identify gene patterns that are unique to cluster 1. It's sort of like the 20 questions game - with answers to the right questions about a cell (e.g. "Do you highly express Ptn?"), you can clearly identify what cluster it falls in.

:::: challenge

Why do you think marker genes are found by aggregating pairwise comparisons rather than iteratively comparing each cluster to all other clusters?

::: solution

One important reason why is because averages over all other clusters can be sensitive to the cell type composition. If a rare cell type shows up in one sample, the most discriminative marker genes found in this way could be very different from those found in another sample where the rare cell type is absent.

Generally, it's good to keep in mind that the concept of "everything else" is not a stable basis for comparison. Read that sentence again, because its a subtle but broadly applicable point. Think about it and you can probably identify analogous issues in fields outside of single-cell analysis. It frequently comes up when comparisons between multiple categories are involved.

:::
::::


## Cell type annotation

The most challenging task in scRNA-seq data analysis is arguably the
Expand Down Expand Up @@ -227,6 +261,8 @@ ind <- sample(ncol(ref), 1000)
ref <- ref[,ind]
```

You can see we have an assortment of different cell types in the reference (with varying frequency):

```{r ref-celltypes}
tab <- sort(table(ref$celltype), decreasing = TRUE)
tab
Expand Down Expand Up @@ -270,29 +306,35 @@ sce.mat <- as.matrix(assay(sce, "logcounts"))
ref.mat <- as.matrix(assay(ref, "logcounts"))
```

Finally, run SingleR with the query and reference datasets:

```{r singler}
res <- SingleR(test = sce.mat, ref = ref.mat, labels = ref$celltype)
res <- SingleR(test = sce.mat,
ref = ref.mat,
labels = ref$celltype)
res
```

We inspect the results using a heatmap of the per-cell and label scores.
Ideally, each cell should exhibit a high score in one label relative to
all of the others, indicating that the assignment to that label was
unambiguous. This is largely the case for mesenchyme and endothelial
cells, whereas we see expectedly more ambiguity between the two
erythroid cell populations.
unambiguous.

```{r score-heat, fig.width = 10, fig.height = 10}
plotScoreHeatmap(res)
```

We also compare the cell type assignments with the clustering results to
determine the identity of each cluster. Here, several cell type classes
are nested within the same cluster, indicating that these clusters are
composed of several transcriptomically similar cell populations (such as
cluster 4 and 6). On the other hand, there are also instances where we
have several clusters for the same cell type, indicating that the
clustering represents finer subdivisions within these cell types.
We obtained fairly unambiguous predictions for mesenchyme and endothelial
cells, whereas we see expectedly more ambiguity between the two
erythroid cell populations.

We can also compare the cell type assignments with the unsupervised clustering
results to determine the identity of each cluster. Here, several cell type
classes are nested within the same cluster, indicating that these clusters are
composed of several transcriptomically similar cell populations. On the other
hand, there are also instances where we have several clusters for the same cell
type, indicating that the clustering represents finer subdivisions within these
cell types.

```{r, fig.width = 10, fig.height = 10}
library(pheatmap)
Expand All @@ -311,6 +353,32 @@ tab <- table(res$pruned.labels, sce$celltype.mapped)
pheatmap(log2(tab + 10), color = colorRampPalette(c("white", "blue"))(101))
```

:::: challenge

SingleR can be computationally expensive. How do you set it to run in parallel?

::: solution

Use `BiocParallel` and the `BPPARAM` argument! This example will set it to use four cores on your laptop, but you can also configure BiocParallel to use cluster jobs.

```{r eval=FALSE, echo = TRUE}

library(BiocParallel)

my_bpparam = MulticoreParam(workers = 4)

res <- SingleR(test = sce.mat,
ref = ref.mat,
labels = ref$celltype,
BPPARAM = my_bpparam)
```

`BiocParallel` is the most common way to enable parallel computation in Bioconductor packages, so you can expect to see it elsewhere outside of SingleR.

:::

::::

### Assigning cell labels from gene sets

A related strategy is to explicitly identify sets of marker genes that
Expand Down Expand Up @@ -403,6 +471,19 @@ mixture, and the grey curve represents a fitted normal distribution.
Vertical lines represent threshold estimates corresponding to each
estimate of the distribution.

:::: challenge

The diagnostics don't look so good for some of the examples here. Which ones? Why?

::: solution

The example that jumps out most strongly to the eye is ExE endoderm, which doesn't show clear separate modes. Simultaneously, Endothelium seems to have three or four modes.

Remember, this is an exploratory diagnostic, not the final word! At this point it'd be good to engage in some critical inspection of the results. Maybe we don't have enough / the best marker genes. In this particular case, the fact that we subsetted the reference set to 1000 cells probably didn't help.
:::

::::

## Session Info

```{r sessionInfo}
Expand Down
Loading
Loading