Skip to content

Commit

Permalink
add domain detection to tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
ToryDeng committed Dec 23, 2023
1 parent 6696866 commit a23f9f8
Show file tree
Hide file tree
Showing 5 changed files with 109 additions and 447 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,4 +160,5 @@ cython_debug/
#.idea/


data/*
data/
notebooks/
Binary file modified docs/assets/img/co-expression.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/spatial-clustering.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
151 changes: 107 additions & 44 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,18 +87,37 @@ import scanpy as sc
import squidpy as sq
import STAGATE_pyG
```

In this tutorial, we utilize [STAGATE](https://doi.org/10.1038/s41467-022-29439-6) to denoise gene expressions from the SRT dataset. `STAGATE` is available in two versions: one based on TensorFlow, and another using the [PyG](https://pyg.org/) library. We will be using the PyG version, `STAGATE_pyG`. `STAGATE_pyG` is not included in `LEGEND`'s dependencies, so make sure to install it separately by the instructions in its [documentation](https://stagate.readthedocs.io/en/latest/Installation_pyG.html).

### Obtain the Datasets

We will be working with two mouse brain datasets featured in our paper. The scRNA-seq dataset is publicly available in the GEO database under the accession number [GSE115746](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115746). The SRT dataset is hosted on the [10x Genomics official site](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Adult_Mouse_Brain). As a convenient alternative, `LEGEND` provides functions for directly downloading these datasets:
We will be working with two mouse brain datasets featured in our paper. The scRNA-seq dataset is publicly available in the GEO database under the accession number [GSE115746](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115746). The SRT dataset is hosted on the [10x Genomics official site](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Adult_Mouse_Brain). These datasets can also be downloaded via [squidpy](https://squidpy.readthedocs.io/en/stable/api.html#module-squidpy.datasets). Below, we will use `Squidpy` and `OpenCV` to load the datasets along with the H&E-stained image:

```python
# Get the scRNA-seq dataset
adata_rna = lg.load_mouse_cortex()
adata_rna = sq.datasets.sc_mouse_cortex()

# Get the SRT dataset and its H&E image
adata_st = sq.datasets.visium_hne_adata()
img = cv2.imread("/path/to/visium_hne.tiff")
```

### Data Preprocessing

Data preprocessing is vital to ensure the quality of downstream analyses. We start by using the raw counts present in both datasets:

```python
adata_rna = adata_rna.raw.to_adata()
adata_st = adata_st.raw.to_adata()
```

Next, we filter out low-expression genes to reduce the dimensionality of the data and to focus on the most informative features:

# Get the SRT dataset and the H&E image
adata_st, img = lg.load_mouse_brain()
```python
# Filter genes based on the number of cells with expression
sc.pp.filter_genes(adata_rna, min_cells=3)
sc.pp.filter_genes(adata_st, min_cells=3)
```

The variables `adata_rna` and `adata_st` are `AnnData` objects, which are data structures for storing gene expression data along with annotations of cells, spots, and genes from the scRNA-seq and SRT datasets, respectively. The `img` variable is a `NumPy` array representing an H&E stained image of brain tissue and has dimensions (11757, 11291, 3), corresponding to its width, height, and color channels.
Expand All @@ -111,11 +130,10 @@ adata_rna

```text
AnnData object with n_obs × n_vars = 21697 × 36825
obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'n_genes'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
```

Expand All @@ -124,11 +142,12 @@ adata_st
```

```text
AnnData object with n_obs × n_vars = 2702 × 19652
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
uns: 'spatial'
obsm: 'spatial'
AnnData object with n_obs × n_vars = 2688 × 18078
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'leiden', 'cluster'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
uns: 'cluster_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'spatial', 'umap'
obsm: 'X_pca', 'X_umap', 'spatial'
obsp: 'connectivities', 'distances'
```

### Run LEGEND
Expand All @@ -151,7 +170,7 @@ Let's elaborate on these parameters:

- **version:** `LEGEND` provides two variants for gene clustering: `fast` and `ps`. The `fast` version quickly clusters genes but doesn't assess gene relevance, redundancy, or complementarity. The `ps` version, recommended for multi-modality analysis, performs these comprehensive computations.
- **verbosity:** The verbosity level. `0` displays only warnings and errors. `1` also shows info messages. `2`, also shows debug messages.
- **relevant_gene_pct:** The percentage of genes to include as relevant. Here we use a relatively small value.
- **relevant_gene_pct:** The percentage of genes to include as relevant. Here we use a relatively small value to reduce the computational time required for this tutorial.
- **return_info:** Whether intermediate results are returned alongside the selected genes. Set to `True` for a more in-depth review of the clustering process.

Now, apply `lg.GeneClust` to the scRNA-seq dataset:
Expand All @@ -165,7 +184,7 @@ info_rna, sc_genes = lg.GeneClust(
`n_obs_clusters` is the number of clusters in cell/spots clustering. It is used in indentification of high-confidence cells/spots.
For the scRNA-seq dataset, we specify `n_obs_clusters=23` which is equal to the true number of cell types. `modality="sc"` indicates the modality is scRNA-seq. `info_rna` is an `AnnData` object storing intermediate results, and `sc_genes` is a `NumPy` array comprising selected genes.

During processing, a warning may appear:
If `adata.X` contains normalized counts, a warning may appear during processing:

```text
2023-12-21 22:56:16.462 | WARNING | LEGEND._validation:check_raw_counts:71 - Will directly use the possible normalized counts found in `adata.X`.
Expand All @@ -180,12 +199,12 @@ info_rna
```

```text
AnnData object with n_obs × n_vars = 3076 × 3683
obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'n_genes', 'cluster'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'relevance', 'outlier_score', 'cluster', 'representative'
uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'MST'
AnnData object with n_obs × n_vars = 3889 × 3683
obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'cluster'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'relevance', 'outlier_score', 'cluster', 'representative'
uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'pearson_residuals_normalization', 'MST'
obsm: 'X_pca', 'X_umap'
varm: 'PCs', 'X_pca'
varm: 'X_pca'
obsp: 'connectivities', 'distances'
varp: 'redundancy'
```
Expand All @@ -210,14 +229,18 @@ Here, `modality="st"` specifies that the dataset is SRT. The `alpha` parameter a

The resulting `AnnData` object (`info_st`) and the associated dataset details can be similarly inspected to understand the generated clusters, relevance values, and redundancy scores specific to the SRT dataset.

```python
info_st
```

```text
AnnData object with n_obs × n_vars = 1616 × 1966
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes', 'cluster'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'relevance', 'outlier_score', 'cluster', 'representative'
uns: 'spatial', 'log1p', 'spatial_neighbors', 'MST'
obsm: 'spatial', 'X_pca'
AnnData object with n_obs × n_vars = 1607 × 1808
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'leiden', 'cluster'
var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'relevance', 'outlier_score', 'cluster', 'representative'
uns: 'cluster_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'spatial', 'umap', 'log1p', 'spatial_neighbors', 'MST'
obsm: 'X_pca', 'X_umap', 'spatial'
varm: 'X_pca'
obsp: 'spatial_connectivities', 'spatial_distances'
obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
varp: 'redundancy'
```

Expand All @@ -240,21 +263,23 @@ The output of the integration is summarized in informational logs that detail th


```text
2023-12-21 22:27:56.852 | INFO | Detected 1230 genes shared by SRT and scRNA-seq.
2023-12-21 22:27:57.678 | INFO | Start to compute complementarity on SRT data...
2023-12-21 22:27:57.680 | INFO | Computing gene complementarity...
2023-12-21 22:28:08.358 | INFO | Gene complementarity computed.
2023-12-21 22:28:08.360 | INFO | Start to compute complementarity on scRNA-seq data...
2023-12-21 22:28:08.361 | INFO | Computing gene complementarity...
2023-12-21 22:28:19.918 | INFO | Gene complementarity computed.
2023-12-21 22:28:20.099 | INFO | Selected 301 genes.
2023-12-23 23:24:07.948 | INFO | Detected 1020 genes shared by SRT and scRNA-seq.
2023-12-23 23:24:08.483 | INFO | Start to compute complementarity on SRT data...
2023-12-23 23:24:08.485 | INFO | Computing gene complementarity...
2023-12-23 23:24:15.746 | INFO | Gene complementarity computed.
2023-12-23 23:24:15.749 | INFO | Start to compute complementarity on scRNA-seq data...
2023-12-23 23:24:15.749 | INFO | Computing gene complementarity...
2023-12-23 23:24:23.683 | INFO | Gene complementarity computed.
2023-12-23 23:24:23.829 | INFO | Selected 346 genes.
```

These messages indicate that `LEGEND` has identified `1230` genes in common between the two datasets and has computed their complementarity within each modality. The final selection includes 301 genes that likely have significant roles across the examined modalities.
These messages indicate that `LEGEND` has identified `1020` genes in common between the two datasets and has computed their complementarity within each modality. The final selection includes `346` genes that likely have significant roles across the examined modalities.

> If your downstream analysis, such as domain detection, requires a broader gene set, you may adjust the `relevant_gene_pct` parameter that was set earlier in the `lg.GeneClust` function. Increasing this value allows for a more inclusive gene selection.

## Applications

### Co-expressed genes

Genes in the same cluster are likely to be co-expressed and show similar spatial expression patterns. To highlight the co-expression patterns of genes within clusters identified by `LEGEND`, we first denoise the gene expressions in the SRT dataset using `STAGATE`.
Expand All @@ -273,32 +298,70 @@ adata_st = STAGATE_pyG.train_STAGATE(adata_st, save_reconstrction=True)

```text
------Calculating spatial graph...
The graph contains 15712 edges, 2702 cells.
5.8150 neighbors per cell on average.
Size of Input: (2702, 19652)
100%|██████████| 1000/1000 [00:22<00:00, 43.84it/s]
The graph contains 15580 edges, 2688 cells.
5.7961 neighbors per cell on average.
Size of Input: (2688, 18078)
100%|██████████| 1000/1000 [00:18<00:00, 52.94it/s]
```

As a result, `STAGATE` has produced a denoised gene expression matrix in `adata_st.layers['STAGATE_ReX']`, which we can then explore to visualize the co-expression patterns.

For example, by inspecting two genes, `Caly` and `Zcchc18`, we find that they are in the same cluster:
For example, by inspecting two genes, `Maged1` and `Zcchc18`, we find that they are in the same cluster:

```python
integration_info.var.loc[["Caly", "Zcchc18"], 'cluster']
integration_info.var.loc[["Maged1", "Zcchc18"], 'cluster']
```

```text
Caly 48
Zcchc18 48
Zcchc18 29
Maged1 29
Name: cluster, dtype: int64
```

Both genes belong to cluster 48. We can now proceed to plot their spatial expression across different spots:
Both genes belong to cluster `29`. We can now proceed to plot their spatial expression across different spots:

```python
sq.pl.spatial_scatter(adata_st, layer='STAGATE_ReX', color=["Caly", "Zcchc18"], figsize=(5, 5))
sq.pl.spatial_scatter(adata_st, layer='STAGATE_ReX', color=["Maged1", "Zcchc18"], figsize=(5, 5))
```

<img src="assets/img/co-expression.png" width="100%">

Areas with high expression of one gene correlate with high expression of the other, and similarly for low expression areas. This illustrates the effectiveness of `LEGEND` in identifying biologically relevant gene clusters.


### Spatial Domain Detection

After feature selection with `LEGEND`, we can proceed to downstream analyses such as domain detection. Domain detection is crucial for understanding the spatial organization of gene expression within tissue. To demonstrate the power of `LEGEND` in facilitating domain detection, we apply `SpaGCN` to both the full gene set and the selected genes in `integrated_genes`. This is conveniently accomplished using the `run_SpaGCN` function provided by `LEGEND`:

```python
# Run SpaGCN on the entire dataset
adata_st.obs['pred_all'] = lg.tl.confidence.run_SpaGCN(adata_st, img, n_spot_cluster=15)

# Run SpaGCN using only the selected genes from LEGEND
adata_st.obs['pred_selected'] = lg.tl.confidence.run_SpaGCN(adata_st[:, integrated_genes], img, n_spot_cluster=15)
```

To analyze the resulting domain predictions, we visualize them as follows:

```python
# Visualize the results from both the full gene set and the selected genes
axes = sq.pl.spatial_scatter(adata_st, color=["cluster", "pred_all", "pred_selected"], figsize=(7, 9), wspace=0, return_ax=True)

# Adjust the first legend for 'cluster'
axes[0].legend(ncol=3, bbox_to_anchor=(0.5, -0.1), loc='upper center', frameon=False)

# Adjust the second legend for 'pred_all' and add the Adjusted Rand Index (ARI) score to the title
axes[1].legend(ncol=5, bbox_to_anchor=(0.5, -0.1), loc='upper center', frameon=False)
axes[1].set_title(axes[1].get_title() + f" ARI={np.round(ari(adata_st.obs.cluster, adata_st.obs.pred_all), 2)}")

# Adjust the third legend for 'pred_selected' and add the ARI score to the title
axes[2].legend(ncol=5, bbox_to_anchor=(0.5, -0.1), loc='upper center', frameon=False)
axes[2].set_title(axes[2].get_title() + f" ARI={np.round(ari(adata_st.obs.cluster, adata_st.obs.pred_selected), 2)}")
```

<img src="spatial-clustering.png" width="100%">

The above figure illustrates the spatial clustering results with each domain prediction visualized. The ARI (Adjusted Rand Index) scores indicate the performance of `SpaGCN`, which is observed to improve with the feature selection performed by `LEGEND`, even when hundreds of genes are selected.



Loading

0 comments on commit a23f9f8

Please sign in to comment.