add domain detection to tutorial

ToryDeng · Dec 23, 2023 · a23f9f8 · a23f9f8
1 parent 6696866
commit a23f9f8
Show file tree

Hide file tree

Showing 5 changed files with 109 additions and 447 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,4 +160,5 @@ cython_debug/
 #.idea/
 
 
-data/*
+data/
+notebooks/
diff --git a/docs/assets/img/co-expression.png b/docs/assets/img/co-expression.png
diff --git a/docs/assets/img/spatial-clustering.png b/docs/assets/img/spatial-clustering.png
diff --git a/docs/index.md b/docs/index.md
@@ -87,18 +87,37 @@ import scanpy as sc
 import squidpy as sq
 import STAGATE_pyG
 ```
+
 In this tutorial, we utilize [STAGATE](https://doi.org/10.1038/s41467-022-29439-6) to denoise gene expressions from the SRT dataset. `STAGATE` is available in two versions: one based on TensorFlow, and another using the [PyG](https://pyg.org/) library. We will be using the PyG version, `STAGATE_pyG`. `STAGATE_pyG` is not included in `LEGEND`'s dependencies, so make sure to install it separately by the instructions in its [documentation](https://stagate.readthedocs.io/en/latest/Installation_pyG.html).
 
 ### Obtain the Datasets
 
-We will be working with two mouse brain datasets featured in our paper. The scRNA-seq dataset is publicly available in the GEO database under the accession number [GSE115746](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115746). The SRT dataset is hosted on the [10x Genomics official site](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Adult_Mouse_Brain). As a convenient alternative, `LEGEND` provides functions for directly downloading these datasets:
+We will be working with two mouse brain datasets featured in our paper. The scRNA-seq dataset is publicly available in the GEO database under the accession number [GSE115746](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115746). The SRT dataset is hosted on the [10x Genomics official site](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Adult_Mouse_Brain). These datasets can also be downloaded via [squidpy](https://squidpy.readthedocs.io/en/stable/api.html#module-squidpy.datasets).  Below, we will use `Squidpy` and `OpenCV` to load the datasets along with the H&E-stained image:
 
 ```python
 # Get the scRNA-seq dataset
-adata_rna = lg.load_mouse_cortex()
+adata_rna = sq.datasets.sc_mouse_cortex()
+
+# Get the SRT dataset and its H&E image
+adata_st = sq.datasets.visium_hne_adata()
+img = cv2.imread("/path/to/visium_hne.tiff")
+```
+
+### Data Preprocessing
+
+Data preprocessing is vital to ensure the quality of downstream analyses. We start by using the raw counts present in both datasets:
+
+```python
+adata_rna = adata_rna.raw.to_adata()
+adata_st = adata_st.raw.to_adata()
+```
+
+Next, we filter out low-expression genes to reduce the dimensionality of the data and to focus on the most informative features:
 
-# Get the SRT dataset and the H&E image
-adata_st, img = lg.load_mouse_brain()
+```python
+# Filter genes based on the number of cells with expression
+sc.pp.filter_genes(adata_rna, min_cells=3)
+sc.pp.filter_genes(adata_st, min_cells=3)
 ```
 
 The variables `adata_rna` and `adata_st` are `AnnData` objects, which are data structures for storing gene expression data along with annotations of cells, spots, and genes from the scRNA-seq and SRT datasets, respectively. The `img` variable is a `NumPy` array representing an H&E stained image of brain tissue and has dimensions (11757, 11291, 3), corresponding to its width, height, and color channels.
@@ -111,11 +130,10 @@ adata_rna
 
 ```text
 AnnData object with n_obs × n_vars = 21697 × 36825
-    obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'n_genes'
-    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
+    obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts'
+    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
     uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap'
     obsm: 'X_pca', 'X_umap'
-    varm: 'PCs'
     obsp: 'connectivities', 'distances'
 ```
 
@@ -124,11 +142,12 @@ adata_st
 ```
 
 ```text
-AnnData object with n_obs × n_vars = 2702 × 19652
-    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes'
-    var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
-    uns: 'spatial'
-    obsm: 'spatial'
+AnnData object with n_obs × n_vars = 2688 × 18078
+    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'leiden', 'cluster'
+    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells'
+    uns: 'cluster_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'spatial', 'umap'
+    obsm: 'X_pca', 'X_umap', 'spatial'
+    obsp: 'connectivities', 'distances'
 ```
 
 ### Run LEGEND
@@ -151,7 +170,7 @@ Let's elaborate on these parameters:
 
 - **version:** `LEGEND` provides two variants for gene clustering: `fast` and `ps`. The `fast` version quickly clusters genes but doesn't assess gene relevance, redundancy, or complementarity. The `ps` version, recommended for multi-modality analysis, performs these comprehensive computations.
 - **verbosity:** The verbosity level. `0` displays only warnings and errors. `1` also shows info messages. `2`, also shows debug messages.
-- **relevant_gene_pct:** The percentage of genes to include as relevant. Here we use a relatively small value.
+- **relevant_gene_pct:** The percentage of genes to include as relevant. Here we use a relatively small value to reduce the computational time required for this tutorial.
 - **return_info:** Whether intermediate results are returned alongside the selected genes. Set to `True` for a more in-depth review of the clustering process.
 
 Now, apply `lg.GeneClust` to the scRNA-seq dataset:
@@ -165,7 +184,7 @@ info_rna, sc_genes = lg.GeneClust(
 `n_obs_clusters` is the number of clusters in cell/spots clustering. It is used in indentification of high-confidence cells/spots.
 For the scRNA-seq dataset, we specify `n_obs_clusters=23` which is equal to the true number of cell types. `modality="sc"` indicates the modality is scRNA-seq. `info_rna` is an `AnnData` object storing intermediate results, and `sc_genes` is a `NumPy` array comprising selected genes.
 
-During processing, a warning may appear:
+If `adata.X` contains normalized counts, a warning may appear during processing:
 
 ```text
 2023-12-21 22:56:16.462 | WARNING  | LEGEND._validation:check_raw_counts:71 - Will directly use the possible normalized counts found in `adata.X`.
@@ -180,12 +199,12 @@ info_rna
 ```
 
 ```text
-AnnData object with n_obs × n_vars = 3076 × 3683
-    obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'n_genes', 'cluster'
-    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'relevance', 'outlier_score', 'cluster', 'representative'
-    uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'MST'
+AnnData object with n_obs × n_vars = 3889 × 3683
+    obs: 'sample_name', 'organism', 'donor_sex', 'cell_class', 'cell_subclass', 'cell_cluster', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'cluster'
+    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'relevance', 'outlier_score', 'cluster', 'representative'
+    uns: 'cell_class_colors', 'cell_subclass_colors', 'hvg', 'neighbors', 'pca', 'umap', 'pearson_residuals_normalization', 'MST'
     obsm: 'X_pca', 'X_umap'
-    varm: 'PCs', 'X_pca'
+    varm: 'X_pca'
     obsp: 'connectivities', 'distances'
     varp: 'redundancy'
 ```
@@ -210,14 +229,18 @@ Here, `modality="st"` specifies that the dataset is SRT. The `alpha` parameter a
 
 The resulting `AnnData` object (`info_st`) and the associated dataset details can be similarly inspected to understand the generated clusters, relevance values, and redundancy scores specific to the SRT dataset.
 
+```python
+info_st
+```
+
 ```text
-AnnData object with n_obs × n_vars = 1616 × 1966
-    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes', 'cluster'
-    var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'relevance', 'outlier_score', 'cluster', 'representative'
-    uns: 'spatial', 'log1p', 'spatial_neighbors', 'MST'
-    obsm: 'spatial', 'X_pca'
+AnnData object with n_obs × n_vars = 1607 × 1808
+    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_counts', 'leiden', 'cluster'
+    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'relevance', 'outlier_score', 'cluster', 'representative'
+    uns: 'cluster_colors', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'spatial', 'umap', 'log1p', 'spatial_neighbors', 'MST'
+    obsm: 'X_pca', 'X_umap', 'spatial'
     varm: 'X_pca'
-    obsp: 'spatial_connectivities', 'spatial_distances'
+    obsp: 'connectivities', 'distances', 'spatial_connectivities', 'spatial_distances'
     varp: 'redundancy'
 ```
 
@@ -240,21 +263,23 @@ The output of the integration is summarized in informational logs that detail th
 
 
 ```text
-2023-12-21 22:27:56.852 | INFO  | Detected 1230 genes shared by SRT and scRNA-seq.
-2023-12-21 22:27:57.678 | INFO  | Start to compute complementarity on SRT data...
-2023-12-21 22:27:57.680 | INFO  | Computing gene complementarity...
-2023-12-21 22:28:08.358 | INFO  | Gene complementarity computed.
-2023-12-21 22:28:08.360 | INFO  | Start to compute complementarity on scRNA-seq data...
-2023-12-21 22:28:08.361 | INFO  | Computing gene complementarity...
-2023-12-21 22:28:19.918 | INFO  | Gene complementarity computed.
-2023-12-21 22:28:20.099 | INFO  | Selected 301 genes.
+2023-12-23 23:24:07.948 | INFO  | Detected 1020 genes shared by SRT and scRNA-seq.
+2023-12-23 23:24:08.483 | INFO  | Start to compute complementarity on SRT data...
+2023-12-23 23:24:08.485 | INFO  | Computing gene complementarity...
+2023-12-23 23:24:15.746 | INFO  | Gene complementarity computed.
+2023-12-23 23:24:15.749 | INFO  | Start to compute complementarity on scRNA-seq data...
+2023-12-23 23:24:15.749 | INFO  | Computing gene complementarity...
+2023-12-23 23:24:23.683 | INFO  | Gene complementarity computed.
+2023-12-23 23:24:23.829 | INFO  | Selected 346 genes.
 ```
 
-These messages indicate that `LEGEND` has identified `1230` genes in common between the two datasets and has computed their complementarity within each modality. The final selection includes 301 genes that likely have significant roles across the examined modalities.
+These messages indicate that `LEGEND` has identified `1020` genes in common between the two datasets and has computed their complementarity within each modality. The final selection includes `346` genes that likely have significant roles across the examined modalities.
 
 > If your downstream analysis, such as domain detection, requires a broader gene set, you may adjust the `relevant_gene_pct` parameter that was set earlier in the `lg.GeneClust` function. Increasing this value allows for a more inclusive gene selection.
 
 
+## Applications
+
 ### Co-expressed genes
 
 Genes in the same cluster are likely to be co-expressed and show similar spatial expression patterns. To highlight the co-expression patterns of genes within clusters identified by `LEGEND`, we first denoise the gene expressions in the SRT dataset using `STAGATE`.
@@ -273,32 +298,70 @@ adata_st = STAGATE_pyG.train_STAGATE(adata_st, save_reconstrction=True)
 
 ```text
 ------Calculating spatial graph...
-The graph contains 15712 edges, 2702 cells.
-5.8150 neighbors per cell on average.
-Size of Input:  (2702, 19652)
-100%|██████████| 1000/1000 [00:22<00:00, 43.84it/s]
+The graph contains 15580 edges, 2688 cells.
+5.7961 neighbors per cell on average.
+Size of Input:  (2688, 18078)
+100%|██████████| 1000/1000 [00:18<00:00, 52.94it/s]
 ```
 
 As a result, `STAGATE` has produced a denoised gene expression matrix in `adata_st.layers['STAGATE_ReX']`, which we can then explore to visualize the co-expression patterns.
 
-For example, by inspecting two genes, `Caly` and `Zcchc18`, we find that they are in the same cluster:
+For example, by inspecting two genes, `Maged1` and `Zcchc18`, we find that they are in the same cluster:
 
 ```python
-integration_info.var.loc[["Caly", "Zcchc18"], 'cluster']
+integration_info.var.loc[["Maged1", "Zcchc18"], 'cluster']
 ```
 
 ```text
-Caly       48
-Zcchc18    48
+Zcchc18    29
+Maged1     29
 Name: cluster, dtype: int64
 ```
 
-Both genes belong to cluster 48. We can now proceed to plot their spatial expression across different spots:
+Both genes belong to cluster `29`. We can now proceed to plot their spatial expression across different spots:
 
 ```python
-sq.pl.spatial_scatter(adata_st, layer='STAGATE_ReX', color=["Caly", "Zcchc18"], figsize=(5, 5))
+sq.pl.spatial_scatter(adata_st, layer='STAGATE_ReX', color=["Maged1", "Zcchc18"], figsize=(5, 5))
 ```
 
 <img src="assets/img/co-expression.png" width="100%">
 
 Areas with high expression of one gene correlate with high expression of the other, and similarly for low expression areas. This illustrates the effectiveness of `LEGEND` in identifying biologically relevant gene clusters.
+
+
+### Spatial Domain Detection
+
+After feature selection with `LEGEND`, we can proceed to downstream analyses such as domain detection. Domain detection is crucial for understanding the spatial organization of gene expression within tissue. To demonstrate the power of `LEGEND` in facilitating domain detection, we apply `SpaGCN` to both the full gene set and the selected genes in `integrated_genes`. This is conveniently accomplished using the `run_SpaGCN` function provided by `LEGEND`:
+
+```python
+# Run SpaGCN on the entire dataset
+adata_st.obs['pred_all'] = lg.tl.confidence.run_SpaGCN(adata_st, img, n_spot_cluster=15)
+
+# Run SpaGCN using only the selected genes from LEGEND
+adata_st.obs['pred_selected'] = lg.tl.confidence.run_SpaGCN(adata_st[:, integrated_genes], img, n_spot_cluster=15)
+```
+
+To analyze the resulting domain predictions, we visualize them as follows:
+
+```python
+# Visualize the results from both the full gene set and the selected genes
+axes = sq.pl.spatial_scatter(adata_st, color=["cluster", "pred_all", "pred_selected"], figsize=(7, 9), wspace=0, return_ax=True)
+
+# Adjust the first legend for 'cluster'
+axes[0].legend(ncol=3, bbox_to_anchor=(0.5, -0.1), loc='upper center', frameon=False)
+
+# Adjust the second legend for 'pred_all' and add the Adjusted Rand Index (ARI) score to the title
+axes[1].legend(ncol=5, bbox_to_anchor=(0.5, -0.1), loc='upper center', frameon=False)
+axes[1].set_title(axes[1].get_title() + f" ARI={np.round(ari(adata_st.obs.cluster, adata_st.obs.pred_all), 2)}")
+
+# Adjust the third legend for 'pred_selected' and add the ARI score to the title
+axes[2].legend(ncol=5, bbox_to_anchor=(0.5, -0.1), loc='upper center', frameon=False)
+axes[2].set_title(axes[2].get_title() + f" ARI={np.round(ari(adata_st.obs.cluster, adata_st.obs.pred_selected), 2)}")
+```
+
+<img src="spatial-clustering.png" width="100%">
+
+The above figure illustrates the spatial clustering results with each domain prediction visualized. The ARI (Adjusted Rand Index) scores indicate the performance of `SpaGCN`, which is observed to improve with the feature selection performed by `LEGEND`, even when hundreds of genes are selected.
+
+
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -160,4 +160,5 @@ cython_debug/ @@
     #.idea/
-    data/*
+    data/
+    notebooks/