-
Notifications
You must be signed in to change notification settings - Fork 0
Quality Control
A 10x raw cell x gene object output by cellranger is not solely composed of high-quality cells which can be taken forward into clustering. It can also include empty droplets, low quality, dead/dying stressed cells, doublet events (multiple single-cells encapsulated in a single bead droplet), or other undesirable cells (i.e. red blood cells, whose minimalist transcriptome is fairly uninformative).
Several strategies can augment default QC:
- The cellranger
filtered_feature_bc_matrix
output is filtered to remove empty droplets. However, some analysts prefer to undertake this step themselves with other tools, i.e. DropletUtils.
- A good first step to assess the overall quality of a dataset is to perform an initial QC clustering on all cells without using any QC thresholds.
- This will show the overall quality of the dataset (i.e. the proportion of stressed cells, which might inform on a single poorer-quality sample), and can guide QC threshold selection.
- This approach would also allow removal of a cluster of low-quality cells versus relying on per-cell thresholds.
- The arbitrary QC thresholds provided in scRNAseq tutorials are intended to remove stressed cells (identified via a high % mitochondrial expression), doublets (via very high counts or n genes), or residual non-cellular droplets (via minimum counts).
Implementing this approach has some caveats:
- While stressed cells can reliably be removed by the % mitochondrial expression (<10% for human, <5% for mice), a large proportion of cells with % above the threshold is likely suggestive of overall low-quality data, with implications for analysis beyond this single QC step (i.e. dead cells lead to more ambient expression). Instead of relying on a static threshold, which may fail to remove cells on the edge of the threshold (i.e. 9.8% mito), an initial QC clustering can remove entires clusters high MT cells.
- Doublets can possess a similar-sized transcriptome to singletons, and instead are more reliably removed with dedicated doublet removal tools (see below). Additionally, some cells biologically have much largest transcriptomes (i.e. some tumour cells), and may be erroneously removed with a maximum count threshold.
- A minimum count threshold will remove residual non-cellular droplets or cells like red blood cells. However it also possible rare populations with very small transcriptomes will be removed, so this should be used with caution. As an alternative, RBC can simply be removed by calculating the proportion of haemoglobin transcripts per cell.
If a dataset possesses an abundance of stressed cells, a gene score derived from the marker genes of a cluster of stressed cells identified in an initial QC clustering can be used to remove these cells. This gene list should also be referred to during differential expression testing, as it may contribute in a sample-specific manner and could be misinterpreted as experiment condition-specific expression trends.
Another useful marker of suspected low-quality cells is the abundantly-expressed MALAT1.
- The very simplest way to remove doublet cells (or tiny clusters) is simply to look for unusual co-expression i.e. of B cell and myeloid markers (which can be derived from cluster DEGs).
- Alternatively several dedicated tools have been developed, i.e. scrublet, DoubletDetection.
- Conceptually, many of these tools rely on some form of transcriptional similarity between different cell types, meaning it remains difficult to separate doublets of the same, or very similar, cell types i.e. two CD4+Tm cells, or a CD8+CTL and NK.
- T or B cell doublets can be identified by the presence of a TCR or BCR in non-T/B clusters.
- Some argue doublets are biological, as they capture cells physically interacting (following the logic of PICseq). Perhaps some useful biology can be inferred by an over-representation of two cells types within a dataset's doublet events.
- Droplet-based scRNAseq analyses assume all acquired RNAs are endogenous to cells.
- However, cell-free RNAs contained within the input solution are also captured and sequenced.
- Sequencing of cell-free RNA constitutes a background contamination that confounds biological interpretation.
- Irritatingly, ambient RNA can sometimes go undetected until final downstream analysis like differential expression. Therefore, it is worth assessing the likelihood of ambient expression in a dataset early in analysis.
Ambient RNA and biology
- Condition-specific ambient RNA can give a false impression of condition-specific differential expression.
- For instance, imagine tissue is sequenced from a collection of healthy or cancer patients. The populations are identical across the two cohorts, except the cancer patients possess scRNAseq of tumour cells. If we assume the tumours are metabolically activate, then across cancer samples there could be tumour-derived metabolic gene ambient RNA. If the analyst then performs differential expression between NK cells in the healthy and cancer cohorts, superficially the condition-specific ambient RNA may give the impression of greater metabolic activating and cancer patient NK cells.
Checking for ambient RNA
- Some tools i.e. SoupX are designed to specifically detect batch effect.
- The presence of ambient RNA can also be determined by performing marker expression testing between individual samples instead of clusters within a single cluster of transcriptionally similar cells.
Removing ambient RNA
- Dedicated tools like SoupX are also the best bet to remove ambient expression, although in most cases they require raw cell x gene matrices and need performing at the earliest stages of analysis.
- An imperfect alternative to this intensive approach can be to remove ambient genes from HVGs prior to clustering. This can work very well if only a small number of massively abundant genes (i.e. immunoglobulins) contribute to the ambient RNA profile.
- If ambient RNA is detected and removed, it may be worth recording the gene profile of the extra-cellular RNA for a comparison with DEGs later in analysis (especially if only a subset of samples show the a striking ambient RNA profile).
Kane Foster (22-07-2022)
Analysis Steps
- Preprocessing
- Quality Control
- Clustering
- Phenotyping (WIP)
- Differential Expression
- Differential Abundance
Specific Cell Types