From c051c95b8abc6bdab7148c05ad288a7d5a62a091 Mon Sep 17 00:00:00 2001
From: Greg Way <gregory.way@gmail.com>
Date: Thu, 16 Mar 2017 09:29:48 -0400
Subject: [PATCH] Publication organization (#23)

* update README for release

* update analysis script for only classifier building

* add bash script for building TP53 figures

* remove reference, no longer applicable

* change tissue to cancer in readme

* update readme

* update readme

* update readme

* update readme

* update readme
---
 README.md                       | 92 +++++++++++++++------------------
 scripts/pancancer_classifier.py |  1 -
 scripts/tp53_ddr_figures.sh     | 29 +++++++++++
 tp53_analysis.sh                | 40 +++++++-------
 4 files changed, 90 insertions(+), 72 deletions(-)
 create mode 100644 scripts/tp53_ddr_figures.sh

diff --git a/README.md b/README.md
index b5d24b2..4d62050 100644
--- a/README.md
+++ b/README.md
@@ -6,23 +6,23 @@
 
 A transcriptome can describe the total state of a tumor at a snapshot
 in time. In this repository, we use cancer transcriptomes from The Cancer
-Genome Atlas Pan Cancer dataset to interrogate gene expression states induced
-by deleterious mutations and copy number alterations.
-
-We have previously described the ability of a machine learning classifier to
-detect an NF1 inactivation signature using Glioblastoma data
-([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an
-ensemble of logistic regression classifiers to the problem, but the solutions were
-unstable and overfit. To address these issues, we posited that we could leverage
-data from diverse tissue-types to build a pancancer NF1 classifier. We also
-hypothesized that a RAS classifier would be able to detect tumors with NF1
-inactivation since NF1 directly inhibits RAS activity and there are many more
-examples of samples with RAS mutations.
+Genome Atlas Pan Cancer consortium to interrogate gene expression states
+induced by deleterious mutations and copy number alterations.
 
 The code in this repository is flexible and can build a Pan-Cancer classifier
 for any combination of genes and cancer types using gene expression, mutation,
-and copy number data. Currently, we build classifiers to detect NF1/RAS
-aberration and TP53 inactivation.
+and copy number data. In this repository, we provide examples for building
+classifiers to detect aberration in _TP53_ and _NF1_/RAS signalling.
+
+We have previously described the ability of a machine learning classifier to
+detect an _NF1_ inactivation signature using Glioblastoma data
+([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an
+ensemble of logistic regression classifiers to the problem, but the solutions
+were unstable and overfit. To address these issues, we posited that we could
+leverage data from diverse cancer types to build a pancancer _NF1_ classifier.
+We also hypothesized that a RAS classifier would be able to detect tumors with
+_NF1_ inactivation since _NF1_ directly inhibits RAS activity and there are
+many more examples of samples with RAS mutations.
 
 ## Controlled Access Data
 
@@ -38,26 +38,17 @@ Eventually, all of the controlled access data used in this pipeline will be
 made public. **We will update this database when the data is officially
 released.**
 
-## Cancer Genes
-
-Note that in order to use the copy number integration feature, an additional
-file must be downloaded. The file is `Supplementary Table S2` of
-[Vogelstein _et al._ 2013]("http://doi.org/10.1126/science.1235122"). 
-
-Processed data is located here: `data/vogelstein_cancergenes.tsv`
-
 ## Usage
 
 ### Initialization
 
-The pipeline must first be initialized before use. Initialization will
-download and process data and setup computational environment.
+The pipeline must be initialized before use. Initialization will download and
+process data and setup computational environment.
 
-To initialize enter the following in the command line:
+To initialize, enter the following in the command line:
 
 ```sh
 # Login to synapse to download controlled-access data
-# Note, publicly available Xena data is also available for download
 synapse login
 
 # Create and activate conda environment
@@ -70,37 +61,38 @@ source activate pancancer-classifier
 
 ### Example Scripts
 
-We provide two distinct example pipelines for predicting TP53 and RAS/NF1
+We provide two distinct example pipelines for predicting _TP53_ and _NF1_/RAS
 loss of function.
 
-1. TP53 loss of function (see [tp53_analysis.sh](tp53_analysis.sh))
-2. RAS/NF1 loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh))
+1. _TP53_ loss of function (see [tp53_analysis.sh](tp53_analysis.sh))
+2. _NF1_/RAS loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh))
 
 ### Customization
 
-For custom analyses, use the `pancancer_classifier.py` script with command line
-arguments.
+For custom analyses, use the
+[scripts/pancancer_classifier.py](scripts/pancancer_classifier.py) script with
+command line arguments.
 
 ```
-python pancancer_classifier.py ...
+python scripts/pancancer_classifier.py ...
 ```
 
-| Flag | Abbreviation | Required/Default | Description |
-| ---- | :----------: | :------: | ----------- |
-| `genes` | `-g` | REQUIRED |  Build a classifier for the input gene symbols |
-| `tissues` | `-t` | `Auto` | The tissues to use in building the classifier |
-| `folds` | `-f` | `5` | Number of cross validation folds |
-| `drop` | `-d` | `False` | Decision to drop input genes from expression matrix |
-| `copy_number` | `-u` | `False` | Integrate copy number data to gene event |
-| `filter_count` | `-c` | `15` | Default options to filter tissues if none are specified |
-| `filter_prop` | `-p` | `0.05` | Default options to filter tissues if none are specified |
-| `num_features` | `-n` | `8000` | Number of MAD genes used to build classifier |
-| `alphas` | `-a` | `0.01,0.1,0.15,0.2,0.5,0.8` | The alpha grid to search over in parameter sweep |
-| `l1_ratios` | `-l` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep |
-| `alt_genes` | `-b` | `None` | Alternative genes to test classifier performance |
-| `alt_tissues` | `-s` | `Auto` | Alternative tissues to test classifier performance |
-| `alt_tissue_count` | `-i` | `15` | Filtering used for alternative tissue classification |
-| `alt_filter_prop` | `-r` | `0.05` | Filtering used for alternative tissue classification |
-| `alt_folder` | `-o` | `Auto` | Location to save all classifier figures |
-| `xena` | `-x` | `False` | If present, use publicly available data for building classifier |
+| Flag | Required/Default | Description |
+| ---- | :--------------: | ----------- |
+| `--genes` | Required |  Build a classifier for the input gene symbols |
+| `--diseases` | `Auto` | The disease types to use in building the classifier |
+| `--folds` | `5` | Number of cross validation folds |
+| `--drop` |  `False` | Decision to drop input genes from expression matrix |
+| `--copy_number` |  `False` | Integrate copy number data to gene event |
+| `--filter_count` |  `15` | Default options to filter diseases if none are specified |
+| `--filter_prop` |  `0.05` | Default options to filter diseases if none are specified |
+| `--num_features` |  `8000` | Number of MAD genes used to build classifier |
+| `--alphas` | `0.1,0.15,0.2,0.5,0.8,1` | The alpha grid to search over in parameter sweep |
+| `--l1_ratios` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep |
+| `--alt_genes` | `None` | Alternative genes to test classifier performance |
+| `--alt_diseases` |  `Auto` | Alternative diseases to test classifier performance |
+| `--alt_filter_count` | `15` | Filtering used for alternative disease classification |
+| `--alt_filter_prop` |  `0.05` | Filtering used for alternative disease classification |
+| `--alt_folder` | `Auto` | Location to save all classifier figures |
+| `--remove_hyper` | `False` | Decision to remove hyper mutated tumors |
 
diff --git a/scripts/pancancer_classifier.py b/scripts/pancancer_classifier.py
index 60a13bc..123b3e6 100644
--- a/scripts/pancancer_classifier.py
+++ b/scripts/pancancer_classifier.py
@@ -1,6 +1,5 @@
 """
 Gregory Way 2017
-Heavily modified from https://github.com/cognoma/machine-learning/
 PanCancer Classifier
 scripts/pancancer_classifier.py
 
diff --git a/scripts/tp53_ddr_figures.sh b/scripts/tp53_ddr_figures.sh
new file mode 100644
index 0000000..86397bb
--- /dev/null
+++ b/scripts/tp53_ddr_figures.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+# Pipeline to reproduce figures for the DNA Damage Repair manuscript 
+#
+# Usage: bash scripts/tp53_ddr_figures.sh
+#
+# Output: summarizes the results of the TP53 classifier and outputs
+#         several figures and tables
+
+tp53_dir='classifiers/TP53'
+
+# 1. Apply PanCan classifier to all samples and output scores for each sample
+python scripts/apply_weights.py --classifier $tp53_dir --copy_number
+
+# 2. Summarize and visualize performance of classifiers
+python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss'
+python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53'
+Rscript --vanilla scripts/ddr_summary_figures.R
+Rscript --vanilla scripts/compare_within_models.R \
+        --within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir 
+
+# 3. Perform Snaptron analysis
+# NOTE: Snaptron setup must be performed first. See `pancancer/scripts/snaptron/`
+bash dna_damage_repair_tp53exon.sh
+
+# 4. Perform copy number burden analysis
+python scripts/copy_burden_merge.py --classifier_folder $tp53_dir
+Rscript --vanilla scripts/copy_burden_figures.R
+
diff --git a/tp53_analysis.sh b/tp53_analysis.sh
index 9e37318..44ee94c 100755
--- a/tp53_analysis.sh
+++ b/tp53_analysis.sh
@@ -4,9 +4,8 @@
 #
 # Usage: bash tp53_analysis.sh
 #
-# Output: will run all specified classifiers which will output performance plots
-#         and summarize how a machine learning classifier can detect aberrant
-#         TP53 activity RNAseq, copy number, and gene expression.
+# Output: Will train a pan cancer model to detect TP53 aberration. Will also
+#         train a unique classifier within each specific cancer type
 
 # Set Constants
 tp53_diseases='BLCA,BRCA,CESC,COAD,ESCA,GBM,HNSC,KICH,LGG,LIHC,LUAD,LUSC,'\
@@ -15,24 +14,23 @@ alphas='0.1,0.13,0.15,0.18,0.2,0.3,0.4,0.6,0.7'
 l1_mixing='0.1,0.125,0.15,0.2,0.25,0.3,0.35'
 tp53_dir='classifiers/TP53'
 
-# 1. PanCancer TP53 classification
-python scripts/pancancer_classifier.py --genes 'TP53' --diseases $tp53_diseases \
-        --drop --copy_number --remove_hyper --alt_folder $tp53_dir \
-        --alphas $alphas --l1_ratios $l1_mixing
+# Pan Cancer TP53 classification
+python scripts/pancancer_classifier.py \
+        --genes 'TP53' \
+        --diseases $tp53_diseases \
+        --drop \
+        --copy_number \
+        --remove_hyper \
+        --alt_folder $tp53_dir \
+        --alphas $alphas \
+        --l1_ratios $l1_mixing
 
-# 2. Within disease type TP53 classification
-python scripts/within_tissue_analysis.py --genes 'TP53' \
-        --diseases $tp53_diseases --remove_hyper \
+# Within Disease type TP53 classification
+python scripts/within_tissue_analysis.py \
+        --genes 'TP53' \
+        --diseases $tp53_diseases \
+        --remove_hyper \
         --alt_folder $tp53_dir'/within_disease' \
-        --alphas $alphas --l1_ratios $l1_mixing
-
-# 3. Apply PanCan classifier to all samples and output scores for each sample
-python scripts/apply_weights.py --classifier $tp53_dir --copy_number
-
-# 4. Summarize and visualize performance of classifiers
-python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss'
-python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53'
-Rscript --vanilla scripts/ddr_summary_figures.R
-Rscript --vanilla scripts/compare_within_models.R \
-        --within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir 
+        --alphas $alphas \
+        --l1_ratios $l1_mixing