Skip to content

Commit

Permalink
Publication organization (#23)
Browse files Browse the repository at this point in the history
* update README for release

* update analysis script for only classifier building

* add bash script for building TP53 figures

* remove reference, no longer applicable

* change tissue to cancer in readme

* update readme

* update readme

* update readme

* update readme

* update readme
  • Loading branch information
gwaybio authored Mar 16, 2017
1 parent f010dfd commit c051c95
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 72 deletions.
92 changes: 42 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,23 @@

A transcriptome can describe the total state of a tumor at a snapshot
in time. In this repository, we use cancer transcriptomes from The Cancer
Genome Atlas Pan Cancer dataset to interrogate gene expression states induced
by deleterious mutations and copy number alterations.

We have previously described the ability of a machine learning classifier to
detect an NF1 inactivation signature using Glioblastoma data
([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an
ensemble of logistic regression classifiers to the problem, but the solutions were
unstable and overfit. To address these issues, we posited that we could leverage
data from diverse tissue-types to build a pancancer NF1 classifier. We also
hypothesized that a RAS classifier would be able to detect tumors with NF1
inactivation since NF1 directly inhibits RAS activity and there are many more
examples of samples with RAS mutations.
Genome Atlas Pan Cancer consortium to interrogate gene expression states
induced by deleterious mutations and copy number alterations.

The code in this repository is flexible and can build a Pan-Cancer classifier
for any combination of genes and cancer types using gene expression, mutation,
and copy number data. Currently, we build classifiers to detect NF1/RAS
aberration and TP53 inactivation.
and copy number data. In this repository, we provide examples for building
classifiers to detect aberration in _TP53_ and _NF1_/RAS signalling.

We have previously described the ability of a machine learning classifier to
detect an _NF1_ inactivation signature using Glioblastoma data
([Way _et al._ 2016](http://doi.org/10.1186/s12864-017-3519-7)). We applied an
ensemble of logistic regression classifiers to the problem, but the solutions
were unstable and overfit. To address these issues, we posited that we could
leverage data from diverse cancer types to build a pancancer _NF1_ classifier.
We also hypothesized that a RAS classifier would be able to detect tumors with
_NF1_ inactivation since _NF1_ directly inhibits RAS activity and there are
many more examples of samples with RAS mutations.

## Controlled Access Data

Expand All @@ -38,26 +38,17 @@ Eventually, all of the controlled access data used in this pipeline will be
made public. **We will update this database when the data is officially
released.**

## Cancer Genes

Note that in order to use the copy number integration feature, an additional
file must be downloaded. The file is `Supplementary Table S2` of
[Vogelstein _et al._ 2013]("http://doi.org/10.1126/science.1235122").

Processed data is located here: `data/vogelstein_cancergenes.tsv`

## Usage

### Initialization

The pipeline must first be initialized before use. Initialization will
download and process data and setup computational environment.
The pipeline must be initialized before use. Initialization will download and
process data and setup computational environment.

To initialize enter the following in the command line:
To initialize, enter the following in the command line:

```sh
# Login to synapse to download controlled-access data
# Note, publicly available Xena data is also available for download
synapse login

# Create and activate conda environment
Expand All @@ -70,37 +61,38 @@ source activate pancancer-classifier

### Example Scripts

We provide two distinct example pipelines for predicting TP53 and RAS/NF1
We provide two distinct example pipelines for predicting _TP53_ and _NF1_/RAS
loss of function.

1. TP53 loss of function (see [tp53_analysis.sh](tp53_analysis.sh))
2. RAS/NF1 loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh))
1. _TP53_ loss of function (see [tp53_analysis.sh](tp53_analysis.sh))
2. _NF1_/RAS loss of function (see [ras_nf1_analysis.sh](ras_nf1_analysis.sh))

### Customization

For custom analyses, use the `pancancer_classifier.py` script with command line
arguments.
For custom analyses, use the
[scripts/pancancer_classifier.py](scripts/pancancer_classifier.py) script with
command line arguments.

```
python pancancer_classifier.py ...
python scripts/pancancer_classifier.py ...
```

| Flag | Abbreviation | Required/Default | Description |
| ---- | :----------: | :------: | ----------- |
| `genes` | `-g` | REQUIRED | Build a classifier for the input gene symbols |
| `tissues` | `-t` | `Auto` | The tissues to use in building the classifier |
| `folds` | `-f` | `5` | Number of cross validation folds |
| `drop` | `-d` | `False` | Decision to drop input genes from expression matrix |
| `copy_number` | `-u` | `False` | Integrate copy number data to gene event |
| `filter_count` | `-c` | `15` | Default options to filter tissues if none are specified |
| `filter_prop` | `-p` | `0.05` | Default options to filter tissues if none are specified |
| `num_features` | `-n` | `8000` | Number of MAD genes used to build classifier |
| `alphas` | `-a` | `0.01,0.1,0.15,0.2,0.5,0.8` | The alpha grid to search over in parameter sweep |
| `l1_ratios` | `-l` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep |
| `alt_genes` | `-b` | `None` | Alternative genes to test classifier performance |
| `alt_tissues` | `-s` | `Auto` | Alternative tissues to test classifier performance |
| `alt_tissue_count` | `-i` | `15` | Filtering used for alternative tissue classification |
| `alt_filter_prop` | `-r` | `0.05` | Filtering used for alternative tissue classification |
| `alt_folder` | `-o` | `Auto` | Location to save all classifier figures |
| `xena` | `-x` | `False` | If present, use publicly available data for building classifier |
| Flag | Required/Default | Description |
| ---- | :--------------: | ----------- |
| `--genes` | Required | Build a classifier for the input gene symbols |
| `--diseases` | `Auto` | The disease types to use in building the classifier |
| `--folds` | `5` | Number of cross validation folds |
| `--drop` | `False` | Decision to drop input genes from expression matrix |
| `--copy_number` | `False` | Integrate copy number data to gene event |
| `--filter_count` | `15` | Default options to filter diseases if none are specified |
| `--filter_prop` | `0.05` | Default options to filter diseases if none are specified |
| `--num_features` | `8000` | Number of MAD genes used to build classifier |
| `--alphas` | `0.1,0.15,0.2,0.5,0.8,1` | The alpha grid to search over in parameter sweep |
| `--l1_ratios` | `0,0.1,0.15,0.18,0.2,0.3` | The l1 ratio grid to search over in parameter sweep |
| `--alt_genes` | `None` | Alternative genes to test classifier performance |
| `--alt_diseases` | `Auto` | Alternative diseases to test classifier performance |
| `--alt_filter_count` | `15` | Filtering used for alternative disease classification |
| `--alt_filter_prop` | `0.05` | Filtering used for alternative disease classification |
| `--alt_folder` | `Auto` | Location to save all classifier figures |
| `--remove_hyper` | `False` | Decision to remove hyper mutated tumors |

1 change: 0 additions & 1 deletion scripts/pancancer_classifier.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""
Gregory Way 2017
Heavily modified from https://github.com/cognoma/machine-learning/
PanCancer Classifier
scripts/pancancer_classifier.py
Expand Down
29 changes: 29 additions & 0 deletions scripts/tp53_ddr_figures.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

# Pipeline to reproduce figures for the DNA Damage Repair manuscript
#
# Usage: bash scripts/tp53_ddr_figures.sh
#
# Output: summarizes the results of the TP53 classifier and outputs
# several figures and tables

tp53_dir='classifiers/TP53'

# 1. Apply PanCan classifier to all samples and output scores for each sample
python scripts/apply_weights.py --classifier $tp53_dir --copy_number

# 2. Summarize and visualize performance of classifiers
python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss'
python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53'
Rscript --vanilla scripts/ddr_summary_figures.R
Rscript --vanilla scripts/compare_within_models.R \
--within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir

# 3. Perform Snaptron analysis
# NOTE: Snaptron setup must be performed first. See `pancancer/scripts/snaptron/`
bash dna_damage_repair_tp53exon.sh

# 4. Perform copy number burden analysis
python scripts/copy_burden_merge.py --classifier_folder $tp53_dir
Rscript --vanilla scripts/copy_burden_figures.R

40 changes: 19 additions & 21 deletions tp53_analysis.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@
#
# Usage: bash tp53_analysis.sh
#
# Output: will run all specified classifiers which will output performance plots
# and summarize how a machine learning classifier can detect aberrant
# TP53 activity RNAseq, copy number, and gene expression.
# Output: Will train a pan cancer model to detect TP53 aberration. Will also
# train a unique classifier within each specific cancer type

# Set Constants
tp53_diseases='BLCA,BRCA,CESC,COAD,ESCA,GBM,HNSC,KICH,LGG,LIHC,LUAD,LUSC,'\
Expand All @@ -15,24 +14,23 @@ alphas='0.1,0.13,0.15,0.18,0.2,0.3,0.4,0.6,0.7'
l1_mixing='0.1,0.125,0.15,0.2,0.25,0.3,0.35'
tp53_dir='classifiers/TP53'

# 1. PanCancer TP53 classification
python scripts/pancancer_classifier.py --genes 'TP53' --diseases $tp53_diseases \
--drop --copy_number --remove_hyper --alt_folder $tp53_dir \
--alphas $alphas --l1_ratios $l1_mixing
# Pan Cancer TP53 classification
python scripts/pancancer_classifier.py \
--genes 'TP53' \
--diseases $tp53_diseases \
--drop \
--copy_number \
--remove_hyper \
--alt_folder $tp53_dir \
--alphas $alphas \
--l1_ratios $l1_mixing

# 2. Within disease type TP53 classification
python scripts/within_tissue_analysis.py --genes 'TP53' \
--diseases $tp53_diseases --remove_hyper \
# Within Disease type TP53 classification
python scripts/within_tissue_analysis.py \
--genes 'TP53' \
--diseases $tp53_diseases \
--remove_hyper \
--alt_folder $tp53_dir'/within_disease' \
--alphas $alphas --l1_ratios $l1_mixing

# 3. Apply PanCan classifier to all samples and output scores for each sample
python scripts/apply_weights.py --classifier $tp53_dir --copy_number

# 4. Summarize and visualize performance of classifiers
python scripts/visualize_decisions.py --scores $tp53_dir --custom 'TP53_loss'
python scripts/map_mutation_class.py --scores $tp53_dir --genes 'TP53'
Rscript --vanilla scripts/ddr_summary_figures.R
Rscript --vanilla scripts/compare_within_models.R \
--within_dir $tp53_dir'/within_disease' --pancan_summary $tp53_dir
--alphas $alphas \
--l1_ratios $l1_mixing

0 comments on commit c051c95

Please sign in to comment.