Abstract: Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or vision-captions data, disregarding invaluable pathology reports and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Here we curated the largest multimodal dataset consisting of H&E diagnostic whole slide images and their associated pathology reports and RNA-Seq data, resulting in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm which injects multimodal knowledge at the whole-slide context into the pathology FM, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the workflow of pretraining for CPath, which enables the pathology FM to acquire the whole-slide context. To our knowledge, this is the first attempt to incorporate multimodal knowledge at the slide level for enhancing pathology FMs, expanding the modelling context from unimodal to multimodal knowledge and from patch-level to slide-level. To systematically evaluate the capabilities of mSTAR, extensive experiments including slide-level unimodal and multimodal applications, are conducted across 7 diverse types of tasks on 43 subtasks, resulting in the largest spectrum of downstream tasks. The average performance in various slide-level applications consistently demonstrates significant performance enhancements for mSTAR compared to SOTA FMs.
This repo has been tested on the following system and GPU:
- Ubuntu 22.04.3 LTS
- NVIDIA H800 PCIe 80GB
First clone the repo and cd into the directory:
git clone https://github.com/Innse/mSTAR.git
cd mSTAR
To get started, create a conda environment containing the required dependencies:
conda env create -f mSTAR.yml
Activate the environment:
conda activate mSTAR
Request access to the model weights from the 🤗Huggingface model page at: https://huggingface.co/Wangyh/mSTAR
We use the timm
library to define the ViT-L/16 model architecture. Pretrained weights and image transforms for mSTAR need to be manually loaded and defined.
import timm
from torchvision import transforms
import torch
ckpt_path = 'where you store the mSTAR.pth file'
transform = transforms.Compose(
[
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
]
)
model = timm.create_model(
"vit_large_patch16_224", img_size=224, patch_size=16, init_values=1e-5, num_classes=0, dynamic_img_size=True
)
model.load_state_dict(torch.load(ckpt_path, map_location="cpu"), strict=True)
model.eval()
You can use the mSTAR pretrained encoder to extract features from histopathology patches, as follows:
from PIL import Image
image = Image.open("patch.png")
image = transform(image).unsqueeze(dim=0)
feature_emb = model(image)
You can also try it in tutorial.ipynb.
Meanwhile, we provide the example showing how to conduct feature extract on TCGA-LUSC based on CLAM.
In Feature_extract/LUSC.sh, you need to set the following directories:
- DATA_DIRECTORY: This should be set to the directory which contains the WSI data.
- DIR_TO_COORDS: This should be set to the directory that contains the coordinate information for the WSI patches preprocessed through CLAM.
- FEATURES_DIRECTORY: This is the directory where you want to store the extracted features.
models='mSTAR'
declare -A gpus
gpus['mSTAR']=0
CSV_FILE_NAME="./dataset_csv/LUSC.csv"
DIR_TO_COORDS="path/DIR_TO_COORDS"
DATA_DIRECTORY="path/DATA_DIRECTORY"
FEATURES_DIRECTORY="path/features"
ext=".svs"
for model in $models
do
echo $model", GPU is:"${gpus[$model]}
export CUDA_VISIBLE_DEVICES=${gpus[$model]}
python extract_feature.py \
--data_h5_dir $DIR_TO_COORDS \
--data_slide_dir $DATA_DIRECTORY \
--csv_path $CSV_FILE_NAME \
--feat_dir $FEATURES_DIRECTORY \
--batch_size 256 \
--model $model \
--slide_ext $ext
done
For more details about feature extraction, please check here
We currently support the following downstram task:
- Slide-level Diagnostic Tasks
- Molecular Prediction
- Cancer Survival Prediction
- Multimodal Survival Analysis
- Few-shot Slide Classification
- Zero-shot Slide Classification
- Report Generation
Here is a simple demo on how to conduct cancer survival prediction on TCGA-LUSC
cd downstream_task/survival_prediction
The feature directory should look like:
TCGA-LUSC
└─pt_files
└─mSTAR
├── feature_1.pt
├── feature_2.pt
├── feature_3.pt
└── ...
You need to specify the path of the feature directory and choose the model. After you have completed all the settings, you can run the following commands.
feature_path='/feature_path' #change here
studies='LUSC'
models='AttMIL'
features='mSTAR'
lr=2e-4
# ckpt for pretrained aggregator
# aggregator='aggregator'
# export WANDB_MODE=dryrun
cd ..
for feature in $features
do
for study in $studies
do
for model in $models
do
CUDA_VISIBLE_DEVICES=0 python main.py --model $model \
--csv_file ./dataset_csv/${study}_Splits.csv \
--feature_path $feature_path \
--study $study \
--modal WSI \
--num_epoch 30 \
--batch_size 1 \
--lr $lr \
--feature $feature \
done
done
done
The total time to run this demo may take around 10 mins for AttMIL. For more details about survival prediction, please check here
The project was built on top of amazing repositories such as UNI, CLAM and OpenCLIP. We thank the authors and developers for their contribution.
If you find our work useful in your research or if you use parts of this code please consider citing our paper:
Xu, Y., Wang, Y., Zhou, F., Ma, J., Yang, S., Lin, H., ... & Chen, H. (2024). A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model. arXiv preprint arXiv:2407.15362.
@misc{xu2024multimodalknowledgeenhancedwholeslidepathology,
title={A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model},
author={Yingxue Xu and Yihui Wang and Fengtao Zhou and Jiabo Ma and Shu Yang and Huangjing Lin and Xin Wang and Jiguang Wang and Li Liang and Anjia Han and Ronald Cheong Kin Chan and Hao Chen},
year={2024},
eprint={2407.15362},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.15362},
}
ⓒ SmartLab. This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of the mSTAR model and its derivatives, which include models trained on outputs from the mSTAR model or datasets created from the mSTAR model, is prohibited and reguires prior approval.
If you have any question, feel free to email Yingxue XU and Yihui WANG.