From c051c95b8abc6bdab7148c05ad288a7d5a62a091 Mon Sep 17 00:00:00 2001 From: Greg Way Date: Thu, 16 Mar 2017 09:29:48 -0400 Subject: [PATCH] Publication organization (#23) * update README for release * update analysis script for only classifier building * add bash script for building TP53 figures * remove reference, no longer applicable * change tissue to cancer in readme * update readme * update readme * update readme * update readme * update readme --- README.md | 92 +++++++++++++++------------------ scripts/pancancer_classifier.py | 1 - scripts/tp53_ddr_figures.sh | 29 +++++++++++ tp53_analysis.sh | 40 +++++++------- 4 files changed, 90 insertions(+), 72 deletions(-) create mode 100644 scripts/tp53_ddr_figures.sh diff --git a/README.md b/README.md index b5d24b2..4d62050 100644 --- a/README.md +++ b/README.md @@ -6,23 +6,23 @@ A transcriptome can describe the total state of a tumor at a snapshot in time. In this repository, we use cancer transcriptomes from The Cancer -Genome Atlas Pan Cancer dataset to interrogate gene expression states induced -by deleterious mutations and copy number alterations. - -We have previously described the ability of a machine learning classifier to -detect an NF1 inactivation signature using Glioblastoma data -([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an -ensemble of logistic regression classifiers to the problem, but the solutions were -unstable and overfit. To address these issues, we posited that we could leverage -data from diverse tissue-types to build a pancancer NF1 classifier. We also -hypothesized that a RAS classifier would be able to detect tumors with NF1 -inactivation since NF1 directly inhibits RAS activity and there are many more -examples of samples with RAS mutations. +Genome Atlas Pan Cancer consortium to interrogate gene expression states +induced by deleterious mutations and copy number alterations. The code in this repository is flexible and can build a Pan-Cancer classifier for any combination of genes and cancer types using gene expression, mutation, -and copy number data. Currently, we build classifiers to detect NF1/RAS -aberration and TP53 inactivation. +and copy number data. In this repository, we provide examples for building +classifiers to detect aberration in _TP53_ and _NF1_/RAS signalling. + +We have previously described the ability of a machine learning classifier to +detect an _NF1_ inactivation signature using Glioblastoma data +([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an +ensemble of logistic regression classifiers to the problem, but the solutions +were unstable and overfit. To address these issues, we posited that we could +leverage data from diverse cancer types to build a pancancer _NF1_ classifier. +We also hypothesized that a RAS classifier would be able to detect tumors with +_NF1_ inactivation since _NF1_ directly inhibits RAS activity and there are +many more examples of samples with RAS mutations. ## Controlled Access Data @@ -38,26 +38,17 @@ Eventually, all of the controlled access data used in this pipeline will be made public. **We will update this database when the data is officially released.** -## Cancer Genes - -Note that in order to use the copy number integration feature, an additional -file must be downloaded. The file is `Supplementary Table S2` of -[Vogelstein _et al._ 2013]("http://doi.org/10.1126/science.1235122"). - -Processed data is located here: `data/vogelstein_cancergenes.tsv` - ## Usage ### Initialization -The pipeline must first be initialized before use. Initialization will -download and process data and setup computational environment. +The pipeline must be initialized before use. Initialization will download and +process data and setup computational environment. -To initialize enter the following in the command line: +To initialize, enter the following in the command line: ```sh # Login to synapse to download controlled-access data -# Note, publicly available Xena data is also available for download synapse login # Create and activate conda environment @@ -70,37 +61,38 @@ source activate pancancer-classifier ### Example Scripts -We provide two distinct example pipelines for predicting TP53 and RAS/NF1 +We provide two distinct example pipelines for predicting _TP53_ and _NF1_/RAS loss of function. -1. TP53 loss of function (see [tp53_analysis.sh](tp53_analysis.sh)) -2. RAS/NF1 loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh)) +1. _TP53_ loss of function (see [tp53_analysis.sh](tp53_analysis.sh)) +2. _NF1_/RAS loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh)) ### Customization -For custom analyses, use the `pancancer_classifier.py` script with command line -arguments. +For custom analyses, use the +[scripts/pancancer_classifier.py](scripts/pancancer_classifier.py) script with +command line arguments. ``` -python pancancer_classifier.py ... +python scripts/pancancer_classifier.py ... ``` -| Flag | Abbreviation | Required/Default | Description | -| ---- | :----------: | :------: | ----------- | -| `genes` | `-g` | REQUIRED | Build a classifier for the input gene symbols | -| `tissues` | `-t` | `Auto` | The tissues to use in building the classifier | -| `folds` | `-f` | `5` | Number of cross validation folds | -| `drop` | `-d` | `False` | Decision to drop input genes from expression matrix | -| `copy_number` | `-u` | `False` | Integrate copy number data to gene event | -| `filter_count` | `-c` | `15` | Default options to filter tissues if none are specified | -| `filter_prop` | `-p` | `0.05` | Default options to filter tissues if none are specified | -| `num_features` | `-n` | `8000` | Number of MAD genes used to build classifier | -| `alphas` | `-a` | `0.01,0.1,0.15,0.2,0.5,0.8` | The alpha grid to search over in parameter sweep | -| `l1_ratios` | `-l` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep | -| `alt_genes` | `-b` | `None` | Alternative genes to test classifier performance | -| `alt_tissues` | `-s` | `Auto` | Alternative tissues to test classifier performance | -| `alt_tissue_count` | `-i` | `15` | Filtering used for alternative tissue classification | -| `alt_filter_prop` | `-r` | `0.05` | Filtering used for alternative tissue classification | -| `alt_folder` | `-o` | `Auto` | Location to save all classifier figures | -| `xena` | `-x` | `False` | If present, use publicly available data for building classifier | +| Flag | Required/Default | Description | +| ---- | :--------------: | ----------- | +| `--genes` | Required | Build a classifier for the input gene symbols | +| `--diseases` | `Auto` | The disease types to use in building the classifier | +| `--folds` | `5` | Number of cross validation folds | +| `--drop` | `False` | Decision to drop input genes from expression matrix | +| `--copy_number` | `False` | Integrate copy number data to gene event | +| `--filter_count` | `15` | Default options to filter diseases if none are specified | +| `--filter_prop` | `0.05` | Default options to filter diseases if none are specified | +| `--num_features` | `8000` | Number of MAD genes used to build classifier | +| `--alphas` | `0.1,0.15,0.2,0.5,0.8,1` | The alpha grid to search over in parameter sweep | +| `--l1_ratios` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep | +| `--alt_genes` | `None` | Alternative genes to test classifier performance | +| `--alt_diseases` | `Auto` | Alternative diseases to test classifier performance | +| `--alt_filter_count` | `15` | Filtering used for alternative disease classification | +| `--alt_filter_prop` | `0.05` | Filtering used for alternative disease classification | +| `--alt_folder` | `Auto` | Location to save all classifier figures | +| `--remove_hyper` | `False` | Decision to remove hyper mutated tumors | diff --git a/scripts/pancancer_classifier.py b/scripts/pancancer_classifier.py index 60a13bc..123b3e6 100644 --- a/scripts/pancancer_classifier.py +++ b/scripts/pancancer_classifier.py @@ -1,6 +1,5 @@ """ Gregory Way 2017 -Heavily modified from https://github.com/cognoma/machine-learning/ PanCancer Classifier scripts/pancancer_classifier.py diff --git a/scripts/tp53_ddr_figures.sh b/scripts/tp53_ddr_figures.sh new file mode 100644 index 0000000..86397bb --- /dev/null +++ b/scripts/tp53_ddr_figures.sh @@ -0,0 +1,29 @@ +#!/bin/bash + +# Pipeline to reproduce figures for the DNA Damage Repair manuscript +# +# Usage: bash scripts/tp53_ddr_figures.sh +# +# Output: summarizes the results of the TP53 classifier and outputs +# several figures and tables + +tp53_dir='classifiers/TP53' + +# 1. Apply PanCan classifier to all samples and output scores for each sample +python scripts/apply_weights.py --classifier $tp53_dir --copy_number + +# 2. Summarize and visualize performance of classifiers +python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss' +python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53' +Rscript --vanilla scripts/ddr_summary_figures.R +Rscript --vanilla scripts/compare_within_models.R \ + --within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir + +# 3. Perform Snaptron analysis +# NOTE: Snaptron setup must be performed first. See `pancancer/scripts/snaptron/` +bash dna_damage_repair_tp53exon.sh + +# 4. Perform copy number burden analysis +python scripts/copy_burden_merge.py --classifier_folder $tp53_dir +Rscript --vanilla scripts/copy_burden_figures.R + diff --git a/tp53_analysis.sh b/tp53_analysis.sh index 9e37318..44ee94c 100755 --- a/tp53_analysis.sh +++ b/tp53_analysis.sh @@ -4,9 +4,8 @@ # # Usage: bash tp53_analysis.sh # -# Output: will run all specified classifiers which will output performance plots -# and summarize how a machine learning classifier can detect aberrant -# TP53 activity RNAseq, copy number, and gene expression. +# Output: Will train a pan cancer model to detect TP53 aberration. Will also +# train a unique classifier within each specific cancer type # Set Constants tp53_diseases='BLCA,BRCA,CESC,COAD,ESCA,GBM,HNSC,KICH,LGG,LIHC,LUAD,LUSC,'\ @@ -15,24 +14,23 @@ alphas='0.1,0.13,0.15,0.18,0.2,0.3,0.4,0.6,0.7' l1_mixing='0.1,0.125,0.15,0.2,0.25,0.3,0.35' tp53_dir='classifiers/TP53' -# 1. PanCancer TP53 classification -python scripts/pancancer_classifier.py --genes 'TP53' --diseases $tp53_diseases \ - --drop --copy_number --remove_hyper --alt_folder $tp53_dir \ - --alphas $alphas --l1_ratios $l1_mixing +# Pan Cancer TP53 classification +python scripts/pancancer_classifier.py \ + --genes 'TP53' \ + --diseases $tp53_diseases \ + --drop \ + --copy_number \ + --remove_hyper \ + --alt_folder $tp53_dir \ + --alphas $alphas \ + --l1_ratios $l1_mixing -# 2. Within disease type TP53 classification -python scripts/within_tissue_analysis.py --genes 'TP53' \ - --diseases $tp53_diseases --remove_hyper \ +# Within Disease type TP53 classification +python scripts/within_tissue_analysis.py \ + --genes 'TP53' \ + --diseases $tp53_diseases \ + --remove_hyper \ --alt_folder $tp53_dir'/within_disease' \ - --alphas $alphas --l1_ratios $l1_mixing - -# 3. Apply PanCan classifier to all samples and output scores for each sample -python scripts/apply_weights.py --classifier $tp53_dir --copy_number - -# 4. Summarize and visualize performance of classifiers -python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss' -python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53' -Rscript --vanilla scripts/ddr_summary_figures.R -Rscript --vanilla scripts/compare_within_models.R \ - --within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir + --alphas $alphas \ + --l1_ratios $l1_mixing