A pipeline that accurately simulates high quality publicly cancer genomes (VCFs) for eight different tumor types: Breast-AdenoCa, CNS-PiloAstro, Eso-AdenoCa, Kidney-RCC, Liver-HCC, Lymph-CLL, Panc-Endocrine and Prost-AdenoCa. OncoGAN offers a solution to current challenges in data accessibility and privacy while also serving as a powerful tool for enhancing algorithm development and benchmarking.
In addition to this pipeline, we have released 100 simulated VCFs for each of the eight studied tumor types, and that are availbale on HuggingFace and Zotero.
We have created three docker images with all dependencies installed as there are version incompatibility issues between the different modules:
- Training -> Environment and scripts used to train OncoGAN models (CUDA)
- Simulating -> Pipeline for synthetic tumor simulation (CPU only)
- DeepTumour -> Algorithm developed to detect the tumor type of origin based o somatic mutations (Ref)
However, due to the size of the models, they couldn’t be stored in the Docker images and need to be downloaded separately (see Download models section below).
If you don't have docker already installed in your system, please follow these instructions.
# Training
docker pull oicr/oncogan:training_v0.1
# Simulating
docker pull oicr/oncogan:simulating_v0.1
# DeepTumour
docker pull oicr/oncogan:deeptumour
If you don't have singularity already installed in your system, please follow these instructions.
# Training
singularity build oncogan_training_v0.1.sif docker://oicr/oncogan:training_v0.1
# Simulating
singularity build oncogan_simulating_v0.1.sif docker://oicr/oncogan:simulating_v0.1
# DeepTumour
singularity build deeptumour.sif docker://oicr/oncogan:deeptumour
OncoGAN trained models for the eight tumor types and DeepTumour models can be found on HuggingFace and Zotero.
OncoGAN needs two external inputs to simulate new samples:
- The directory with OncoGAN models downloaded previously
- hg19 fasta reference genome without the chr prefix
The output is a VCF file per donor. Since the PCAWG dataset used for training refers to the hg19 version of the genome, the new mutations are also aligned to that version. The integrated LiftOver
version can be used to swicht to hg38.
# Docker command
docker run --rm -u $(id -u):$(id -g) \
-v $(pwd):/home \
-v /PATH_TO_HG19_DIR/:/reference \
-v /PATH_TO_ONCOGAN_MODELS/:/oncoGAN/trained_models \
-it oicr/oncogan:simulating_v0.1 \
vcfGANerator -n 1 --tumor Breast-AdenoCa -r /reference/hs37d5.fa [--hg38]
# Singularity command
singularity exec -H ${pwd}:/home \
-B /PATH_TO_HG19_DIR/:/reference \
-B /PATH_TO_ONCOGAN_MODELS/:/oncoGAN/trained_models \
/PATH_TO/oncogan_simulating_v0.1.sif launcher.py \
vcfGANerator -n 1 --tumor Breast-AdenoCa -r /reference/hs37d5.fa [--hg38]
The options for the vcfGANerator
function are:
vcfGANerator --help
# Command to simulate mutations (VCF) for different tumor types using a GAN model
# Options:
# -@, --cpus INTEGER Number of CPUs to use [default: 1]
# --tumor TEXT Tumor type to be simulated. Run 'availTumors'
# subcommand to check the list of available tumors that
# can be simulated [required]
# -n, --nCases INTEGER Number of cases to simulate [default: 1]
# -r, --refGenome PATH hg19 reference genome in fasta format [required]
# --prefix TEXT Prefix to name the output. If not, '--tumor' option is
# used as prefix
# --outDir DIRECTORY Directory where save the simulations. Default is the
# current directory
# --hg38 Transform the mutations to hg38
# --help Show this message and exit
Among all the options offered by docker (docker run --help
), we recommend:
--rm
: Automatically remove the container when it exits.-u, --user
: Specify the user ID and its group ID. It's useful to not run the pipeline as root.-v, --volume
: Mount local volumes in the container.- With the option
-v $(pwd):/home/
, OncoGAN results will be in your current directory.
- With the option
-i, --interactive
: Keep STDIN open even if not attached.-t, --tty
: Allocate a pseudo-TTY. When combined with-i
it allows you to connect your terminal with the container terminal.
For singularity, the -H
and -B
options are analogous to -v
docker option.
List of available tumors:
docker run --rm -it oicr/oncogan:simulating_v0.1 availTumors
# or
singularity exec /PATH_TO/oncogan_simulating_v0.1.sif launcher.py availTumors
# This is the list of available tumor types that can be simulated using OncoGAN:
# Breast-AdenoCa CNS-PiloAstro Eso-AdenoCa Kidney-RCC
# Liver-HCC Lymph-CLL Panc-Endocrine Prost-AdenoCA
Files used to train OncoGAN models can be found HuggingFace and Zotero. The directory containing these files or your custom training files need to be mounted into the docker/singularity container.
We used two different training approaches:
- CTAB-GAN+ -> To train donor characteristics, drivers and mutational signatures
- CTGAN + TVAE -> To train genomic positions
The baseline command to run the docker/singularity container is the same for all the different training modalities:
docker run --rm -u $(id -u):$(id -g) \
-v $(pwd):/home \
-v /PATH_TO_TRAINING_FILES/:/inputs \
-p 8890:8890 \ #Only for CTGAN/TVAE models
-it oicr/oncogan:training_v0.1 --help
# or
singularity exec -H ${pwd}:/home \
-B /PATH_TO_TRAINING_FILES/:/inputs \
/PATH_TO/oncogan_training_v0.1.sif launcher.py --help
# Options:
# --help Show this message and exit.
# Commands:
# jupyter Launches a jupyter lab instance
# testHyperparameters Test hyperparameters for counts/drivers CTABGAN models
# trainCounts Train a counts CTABGAN model
# trainDrivers Train a drivers CTABGAN model
# trainMutations Train a mutations CTABGAN model
# useModel Use a driver/count CTABGAN model to generate synthetic data
Commands to train donor characteristics, drivers and mutational signatures models, respectively. The output will be a model and a simulated file with the same format and shape as the training file.
# Baseline command +
trainCounts/trainDrivers/trainMutations --help
# Options:
# --csv PATH CSV file with the mutations used to train the model [required]
# --prefix TEXT Prefix to use to save the model [required]
# --outdir DIRECTORY Directory where save the model [default: /home]
# --epochs INTEGER Number of epochs
# --batch_size INTEGER Batch size
# --test_ratio FLOAT Test ratio
# --lr FLOAT Learning rate
# --help Show this message and exit
Command to automatically test a combination of hyperparameters for donor characteristics (counts) and drivers models. The output will be a model and a simulated file for each hyperparameter combination.
# Baseline command +
testHyperparameters --help
# Test hyperparameters for counts/drivers CTABGAN models
# Options:
# --cpu INTEGER Number of CPUs to use [default: 1]
# --function [counts|drivers] Function to test hyperparameters for [required]
# --csv PATH CSV file with the counts used to train the model [required]
# --prefix TEXT Prefix to use to save the model [required]
# --outdir DIRECTORY Directory where save the model [default: /home]
# --epochs <INTEGER INTEGER INTEGER>...
# A list with an epoch range: start stop step
# --batch_size <INTEGER INTEGER INTEGER>...
# A list with a batch_size range: start stop step
# --lr <FLOAT FLOAT FLOAT>... A list with a learning rate range: start stop step
# --debug Greater verbosity for debugging purposes
# --help Show this message and exit
An example of how to specify the range of the hyperparameters to test would be:
# Baseline command +
testHyperparameters --cpu 1 --function counts --csv /inputs/counts/Breast-AdenoCa_counts.csv --prefix Breast-AdenoCa --epochs 100 500 20 --batch_size 10 30 5 --lr 0.001 0.01 0.001
Training commands return the model and a simulated file with the same format and shape as the training file. To generate more simulated files to check the performance of the model you can run the useModel
command:
# Baseline command +
useModel --help
# Use a driver/count CTABGAN model to generate synthetic data
# Options:
# --model PATH Model to use for the simulations (counts/drivers) [required]
# --outdir DIRECTORY Directory where save the simulations [default: /home]
# --nDonors INTEGER Number of donors to simulate for each simulation [default: 1]
# --nFiles INTEGER Number of files to simulate [default: 1]
# --help Show this message and exit
To train genomic position models using the CTGAN/TVAE approach we use a jupyter notebook inside the docker/singularity container. The output will be a model and a simulated file with the same format and shape as the training file.
The notebook with all the steps to train the models can be found here.
# Baseline command +
jupyter --help
# Launches a jupyter lab instance
# Options:
# --port INTEGER Port to launch jupyter lab on [default: 8890]
# --help Show this message and exit.
DeepTumour is a tool that can predict the tumor type of origin based on the pattern of somatic mutations. We used a second version of this tool, that can predict 29 tumor types instead of 24, to validate that our simulations were correctly assigned to their training tumor type. We also trained a new model using a mix of real and synthetic donors, improving the overall accuracy of the model. Both the original and the new model are available on HuggingFace and Zotero. To use them:
docker run --rm -u $(id -u):$(id -g) \
-v $(pwd):/home \
-v /PATH_TO_DEEPTUMOUR_MODEL/:/DeepTumour/trained_models \
-v /PATH_TO_HG19_DIR/:/reference
-it oicr/oncogan:deeptumour --help
# or
singularity exec -H ${pwd}:/home \
-B /PATH_TO_DEEPTUMOUR_MODEL/:/DeepTumour/trained_models \
-B /PATH_TO_HG19_DIR/:/reference
/PATH_TO/oncogan_deeptumour.sif DeepTumour.py --help
# Predict cancer type from a VCF file using DeepTumour
# Options:
# --vcfFile PATH VCF file to analyze [Use --vcfFile or --vcfDir]
# --vcfDir DIRECTORY Directory with VCF files to analyze [Use --vcfFile or --vcfDir]
# --reference PATH hg19 reference genome in fasta format [required]
# --hg38 Use this tag if your VCF is in hg38 coordinates
# --keep_input Use this tag to also save DeepTumour input as a csv file
# --outDir DIRECTORY Directory where save DeepTumour results. Default is the current directory
# --help Show this message and exit.