-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from databio/dev
Development changes into master
- Loading branch information
Showing
20 changed files
with
522 additions
and
181 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.pyc | ||
.~lock* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Change log | ||
All notable changes to this project will be documented in this file. | ||
|
||
## [0.2.0] | ||
### Added | ||
- FRiP can now be calculated based on reference peaks | ||
- Pipeline now reports Picard estimated library size statistic | ||
- Added option for pyadapt trimming | ||
- Added example project using 'gold standard' data | ||
- Added new resource package grades | ||
- Added preliminary 'exact cuts' scripts, but they are not yet used | ||
|
||
### Changed | ||
- Improved README | ||
- Changed filename of the TSS file | ||
- Reorganized structure of alignment code | ||
|
||
## [0.1.0] | ||
### Added | ||
- First release of ATAC-seq pypiper pipeline |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,22 @@ | |
|
||
This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs. | ||
|
||
## Pipeline features outlined | ||
|
||
**Decoy alignments.** Before aligning to the genome, we first align to decoy sequences. This has several advantages: it speeds up the process dramatically, reduces noise from erroneous alignments, and provides potential to analyze signal at repeats. The pipeline will align *sequentially* to these decoy sequences (if provided): | ||
|
||
- chrM (doubled; for non-circular aligners, to draw away reads from NuMTs) | ||
- Alu elements | ||
- alpha satellites | ||
- rDNA | ||
- repbase | ||
|
||
We have provided indexed assemblies for download for each of these **for human** in the [ref_decoy](https://github.com/databio/ref_decoy) repository (excluding repbase, which is not publicly available). Any assemblies not provided are skipped. | ||
|
||
**Fraction of reads in peaks (FRIP).** By default, the pipeline will calculate the FRIP as a quality control, using the peaks it identifies internally. If you want, it will **additionally** calculate a FRIP using a reference set of peaks (for example, from another experiment). For this you must provide a reference peak set (as a bed file) to the pipeline. You can do this by adding a column named `FRIP_ref` to your annotation sheet (see [pipeline_interface.yaml](/config/pipeline_interface.yaml)). Specify the reference peak filename (or use a derived column and specify the path in the project config file `data_sources` section). | ||
|
||
|
||
|
||
## Installing | ||
|
||
**Prerequisites**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a pipeline for a single sample, and [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation). You can do a user-specific install of both like this: | ||
|
@@ -18,13 +34,14 @@ export PATH=$PATH:~/.local/bin | |
|
||
**Required executables**. To run the pipeline, you will also need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section. | ||
|
||
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). The pipeline aligns serially to decoy sequences if you have them set up, which greatly improves pipeline performance. You can set up the decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy). | ||
**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You can set up the (optional) decoy sequences using [ref_decoy](https://github.com/databio/ref_decoy). | ||
|
||
**Clone the pipeline**. Then, clone this repository using one of these methods: | ||
- using SSH: `git clone [email protected]:databio/ATACseq.git` | ||
- using HTTPS: `git clone https://github.com/databio/ATACseq.git` | ||
|
||
## Configuring | ||
|
||
You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, there is a pre-made config file and project template. Follow the instructions on the [Chang lab configuration](examples/chang_project) page. | ||
|
||
Option 1: **Default configuration** ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)). | ||
|
@@ -68,6 +85,29 @@ Your annotation file must specify these columns: | |
|
||
Run your project as above, by passing your project config file to `looper run`. More detailed instructions and advanced options for how to define your project are in the [Looper documentation on defining a project](http://looper.readthedocs.io/en/latest/define-your-project.html). Of particular interest may be the section on [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns). | ||
|
||
## TSS enrichments | ||
|
||
In order to calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. Here's code to generate that. | ||
|
||
From refGene: | ||
|
||
``` | ||
# Provide genome string and gene file | ||
GENOME="hg38" | ||
URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz" | ||
wget -O ${GENOME}_TSS_full.txt.gz ${URL} | ||
zcat ${GENOME}_TSS_full.txt.gz | awk '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}' | LC_COLLATE=C sort -k1,1 -k2,2n -u > ${GENOME}_TSS.tsv | ||
echo ${GENOME}_TSS.tsv | ||
``` | ||
|
||
Another option from Gencode GTF: | ||
|
||
``` | ||
grep "level 1" ${GENOME}.gtf | grep "gene" | awk '{if($7=="+"){print $1"\t"$4"\t"$4"\t"$7}else{print $1"\t"$5"\t"$5"\t"$7}}' | LC_COLLATE=C sort -u -k1,1V -k2,2n > ${GENOME}_TSS.tsv | ||
``` | ||
|
||
## Using a cluster | ||
|
||
Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project. You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
# using pre-fix of fastq file | ||
#python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G mm9 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz | ||
python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz | ||
# using pre-fix of fastq file | ||
#python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G mm9 -Q paired -C ATACseq.yaml -gs mm -I test_data/liver-CD31_test_R1.fastq.gz -I2 test_data/liver-CD31_test_R2.fastq.gz | ||
python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml -gs mm -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
ATAC: ATACseq.py | ||
ATAC: ATACseq.py | ||
ATAC-SEQ: ATACseq.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
|
||
# Gold ATAC | ||
|
||
Testing ATAC-seq pipeline on gold standard public ATAC-seq data. | ||
|
||
## Grab data, project setup | ||
|
||
Download raw `fastq.gz` files (use `fastq-dump` from SRA. You may also use `get_geo.py` to download raw ATAC-seq reads from SRA and metadata from GEO: | ||
|
||
``` | ||
python get_geo.py -i ~/code/ATACseq/examples/gold_atac/metadata/gold_atac_gse.csv -r --fastq | ||
``` | ||
|
||
I used resulting file [metadata/annocomb_gold_atac_gse.csv](metadata/annocomb_gold_atac_gse.csv) to create the looper metadata sheet, [metadata/gold_atac_annotation.csv](metadata/gold_atac_annotation.csv). | ||
|
||
I create project config file and sampled test data. The SRA fastq files should be stored in a folder `${SRAFQ}`, and then this will run with looper with no additional changes. | ||
|
||
## Run pipeline | ||
|
||
``` | ||
looper run ${CODE}ATACseq/examples/gold_atac/metadata/project_config.yaml -d | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
sample_name,Sample_title,Sample_source_name_ch1,organism,Sample_organism_ch1,library,Sample_library_selection,Sample_library_strategy,data_source,Sample_type,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model | ||
ATAC-seq_from_dendritic_cell_(ENCLB065VMV),ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000 | ||
ATAC-seq_from_dendritic_cell_(ENCLB811FLK),ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000 | ||
ATAC-seq_from_dendritic_cell_(ENCLB887PKE),ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500 | ||
ATAC-seq_from_dendritic_cell_(ENCLB586KIS),ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000 | ||
ATAC-seq_from_dendritic_cell_(ENCLB384NOX),ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,Homo sapiens,Homo sapiens,,other,ATAC-seq,SRA,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
sample_name,sample_description,treatment_description,organism,library,data_source,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model,read1,read2 | ||
test1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,TEST_1,TEST_2 | ||
gold1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 | ||
gold2,ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 | ||
gold3,ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500,SRA_1,SRA_2 | ||
gold4,ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 | ||
gold5,ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
GSE94182 | ||
GSE94222 | ||
GSE94177 | ||
GSE94196 | ||
GSE94173 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# This project config file describes your project. See looper docs for details. | ||
|
||
metadata: # relative paths are relative to this config file | ||
sample_annotation: gold_atac_annotation.csv # sheet listing all samples in the project | ||
output_dir: ${PROCESSED}gold_atac # ABSOLUTE PATH to the parent, shared space where project results go | ||
pipelines_dir: "${CODEBASE}ATACseq" # ABSOLUTE PATH the directory where looper will find the pipeline repository | ||
|
||
# in your sample_annotation, columns with these names will be populated as described | ||
# in the data_sources section below | ||
derived_columns: [read1, read2] | ||
|
||
data_sources: # This section describes paths to your data | ||
# specify the ABSOLUTE PATH of input files using variable path expressions | ||
# These keys then correspond to values in your sample annotation columns. | ||
# Variables specified using brackets are populated from sample_annotation columns. | ||
# Variable syntax: {column_name}. For example, use {sample_name} to populate | ||
# the file name with the value in the sample_name column for each sample. | ||
# example_data_source: "/path/to/data/{sample_name}_R1.fastq.gz" | ||
SRA: "${SRABAM}{SRR}.bam" | ||
SRA_1: "${SRAFQ}{SRR}_1.fastq.gz" | ||
SRA_2: "${SRAFQ}{SRR}_2.fastq.gz" | ||
TEST_1: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r1.fastq.gz" | ||
TEST_2: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r2.fastq.gz" | ||
|
||
genomes: | ||
human: hg38 | ||
mouse: mm10 |
Binary file not shown.
Binary file not shown.
Oops, something went wrong.