The SPoTLIghT pipeline consists of several modules:
- SPoTLIghT Modules
- Extracting histopathological features (
extracthistopatho
) - Deconvolution of bulkRNAseq data (
deconvbulk
). - Building a multi-task cell type model to predict cell type abundances on a tile-level (
buildmodel
). - Predicting tile-level cell type abundances using the multi-task models (
predicttiles
) - Compute spatial features using the tile-level cell type abundances (
computespatial
)
- Extracting histopathological features (
In brackets, the abbreviations to use for running the modules of interest
If you want to run all modules, you can set the spotlight_modules
parameter as follows:
spotlight_modules: "extracthistopatho, deconvbulk, buildmodel, predicttiles, computespatial"
When running only a subset of modules, set the parameters required for those modules!
Input files:
-
clinical_file_out_file
: -
image_dir
: Directory with H&E images. -
path_codebook
: Path to codebook.txt -
checkpoint_path
: checkpoints of DL model, see the Tensorflow repository tensorflow/models. Checkpoint used in manuscript can be downloaded via this link and can be found here: https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BSST292. Of note, the path should point to the directory with the checkpoint files. -
path_tissue_classes
: Path to tissue_classes.csv, which is provided. -
is_tcga
: indicate whether the dataset is from the TCGA. -
tumor_purity_threshold
: Minimum tumor purity for a slide to be kept (default=80) -
gradient_mag_filter
: Minimum gradient magnitude, used for filtering non-informative and/or blurry tiles (default=10) -
n_shards
: number of shards for creating TFrecords (default=320) -
bot_out_filename
: Filename for extracted histopathological features (default="bot_train") -
pred_out_filename
: Filename for predictions (default="pred_train") -
model_name
: Name of model used, ensure this corresponds to the model of the checkpoints (default="inception_v4")
-
gene_exp_path
: Path to gene expression file (.txt
) -
is_tpm
: Indicate whether givengene_exp_path
is TPM normalized (default=false) -
quantiseq_path
: Path to results quanTIseq (.csv
) -
epic_path
: Path to results EPIC (.csv
) -
mcp_counter_path
: Path to results MCP counter (.csv
) -
xcell_path
: Path to results xCELL (.csv
)
The above four files are optional, by default all tools will be run. If results for one or more tools have been generated already, please set the paths.
Download the signatures/published scores, see table below.
Parameter | Reference | Additional info |
---|---|---|
absolute_tumor_purity_path |
https://gdc.cancer.gov/about-data/publications/panimmune | Download the 'Score for 160 Genes Signatures in Tumor Samples' or use direct link |
estimate_scores_path |
https://bioinformatics.mdanderson.org/estimate/index.html | Download the relevant file for the cancer type of interest, use the RNA-seqV2 column on the page. |
gibbons_scores_path |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5503821/ | Download the 'Supp Datafile S1.' or use the (direct link)[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5503821/bin/NIHMS840944-supplement-Supp_Datafile_S1.xlsx] |
thorsson_scores_path |
https://gdc.cancer.gov/about-data/publications/panimmune | Download the 'ABSOLUTE purity/ploidy file', or use direct link |
Publicly available scores
-
thorsson_scores_path
: "assets/local/Thorsson_Scores_160_Signatures.tsv" -
estimate_scores_path
: "assets/local/ESTIMATE.xlsx" -
absolute_tumor_purity_path
: "assets/local/TCGA_ABSOLUTE.txt" -
gibbons_scores_path
: "assets/local/Gibbons.xlsx" For more information please see the table in modules/trainmultitaskmodel.md -
bottleneck_features_path
: Path to extracted histopathological features, generated by moduleextracthistopatho
-
var_names_path
: "assets/task_selection_names.pkl" -
target_features_path
: "assets/NO_FILE" -
model_cell_types
: String of cell types for which a multi-task models has to be build (default="CAFs, Endothelial_cells, T_cells, tumor_purity").
Please note, that models can only be build for the cell types mentioned in the default.
Setup nested cross-validation
alpha_min
: Min. value for grid, 10^alpha_min (default=-4)alpha_max
: Max. value for grid, 10^alpha_max (default=-1)n_steps
: Number of steps in grid (default=40)n_outerfolds
: Number of outer folds (default=5)n_innerfolds
: Number of inner folds(default=10)n_tiles
: Number of tiles selected per slide (default=50)split_level
: Variable to split data on (default="sample_submitter_id")
celltype_models_path
: Path to directory with the models for each cell type, where each cell type has to have its own folder. For an example of the structure see provided models assets/TF_models/SKCM_FF (default="assets/TF_models/SKCM_FF")prediction_mode
: (default="test")
-
out_prefix
: "dummy" -
graphs_path
: Path to file (.pkl
) with the graphs for all slides. Not required, if left default or if not set, this will be generated. -
abundance_threshold
: Min. abundance (probability) for assigning cell type (default=0.5) -
shapiro_alpha
: Significance level for shapiro test (normality) (default=0.05) -
cutoff_path_length
: Max. path length (default=2) -
n_clusters
: Number of clusters to generate (default = 8) -
max_dist
: "dummy" -
max_n_tiles_threshold
: 2 -
tile_size
: Size of tiles in pixels (default=512) -
overlap
: Overlap of directly neighboring tiles (default=50) -
metadata_path
: Path to file with metadata -
merge_var
: Variable for merging metadata and spatial features, (default="slide_submitter_id") -
sheet_name
: Ifmetadata_path
points to an Excel file, give the 'sheet_name' to read from.