Merge pull request #21 from databio/dev

Release 0.4
databio · Jul 21, 2017 · 43cdb39 · 43cdb39
2 parents 7a41ca3 + a426e17
commit 43cdb39
Show file tree

Hide file tree

Showing 17 changed files with 1,463 additions and 514 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,10 @@
+# General
 *.pyc
-.~lock*
+.~lock*
+
+# JetBrains
+.idea/
+
+# Tests
+.cache/
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,11 +1,26 @@
 # Change log
 All notable changes to this project will be documented in this file.
 
+## [0.4.0] -- 2017-07-21
+
+### Added
+- Added [fseq](https://github.com/aboyle/F-seq) as a peak caller option
+- Peak caller is specified by a command line argument (defaults to macs2)
+- Count of called peaks is now reported as a pipeline result.
+- Add R and ggplot2 as requirements
+
+### Changed
+- Changed TSS plotting to use R instead of python
+- TSS plot failures no longer fail the pipeline.
+- Changed `Read_type` to `read_type` to prevent duplicate columns
+- Read trimmer is now specified in option + argument style rather than as a flag.
+
 ## [0.3.0] -- 2017-06-22
 
 ### Added
 - Added exact cuts calculation
-- Adds command-line version display
+- Added command-line version display
+- Added skewer as a trimmer option 
 - Uses looper 'implied columns' (from looper v0.6) to derive multiple variables from organism value
 
 ### Changed
@@ -32,4 +47,4 @@ All notable changes to this project will be documented in this file.
 
 ## [0.1.0]
 ### Added
-- First release of ATAC-seq pypiper pipeline
+- First release of ATAC-seq pypiper pipeline
diff --git a/README.md b/README.md
@@ -16,26 +16,38 @@ These features are explained in more detail later in this README.
 
 ## Installing
 
-**Prerequisite python packages**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a single sample, [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation), and [pararead](https://github.com/databio/pararead) for parallel processing sequence reads. You can do a user-specific install of these like this:
+### Prequisites
+
+**Python packages**. This pipeline uses [pypiper](https://github.com/epigen/pypiper) to run a single sample, [looper](https://github.com/epigen/looper) to handle multi-sample projects (for either local or cluster computation), and [pararead](https://github.com/databio/pararead) for parallel processing sequence reads. You can do a user-specific install of these like this:
 
 ```
 pip install --user https://github.com/epigen/pypiper/zipball/master
 pip install --user https://github.com/epigen/looper/zipball/master
 pip install --user https://github.com/databio/pararead/zipball/master
 ```
+**R packages**. This pipeline uses R to generate QC metric plots. These are **optional** and if you don't install these R packages (or R in general), the pipeline will still work, but you will not get the QC plot outputs. 
 
-Version 0.3 of this pipeline requires looper version 0.6 or greater. You can upgrade looper with: `pip install --user --upgrade https://github.com/epigen/looper/zipball/master`.
+The following packages are used by the qc scripts:
+- ggplot2
+- gplots (v3.0.1)
+- reshape2 (v1.4.2)
+
+You can install these packages like this:
+```
+R # start R
+install.packages(c("ggplot2", "gplots", "reshape2"))
+```
 
 **Required executables**. You will need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section.
 
 **Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You may [download pre-indexed references](http://cloud.databio.org/refgenomes) or you may index your own (see [refgenie](https://github.com/databio/refgenie) instructions). Any prealignments you want to do use will also require refgenie assemblies. Some common examples are provided by [ref_decoy](https://github.com/databio/ref_decoy).
 
+### Configuring the pipeline
+
 **Clone the pipeline**. Clone this repository using one of these methods:
 - using SSH: `git clone [email protected]:databio/ATACseq.git`
 - using HTTPS: `git clone https://github.com/databio/ATACseq.git`
 
-## Configuring
-
 There are two configuration options: You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, you may use the pre-made config file and project template described on the [Chang lab configuration](examples/chang_project) page. For others, choose one:
 
 **Option 1: Default configuration** (recommended; [pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)). 
@@ -56,14 +68,15 @@ There are two configuration options: You can either set up environment variables
 
 **Option 2: Custom configuration**. Instead, you can also put absolute paths to each tool or resource in the configuration file to fit your local setup. Just change the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) appropriately. 
 
+## Usage
 
-## Running the pipeline
-
-You have options for running the pipeline. This is a looper-compatible pipeline, so you never need to interface with the pipeline directly, but you can if you want. 
+You have two options for running the pipeline. 
 
 ### Option 1: Running the pipeline script directly
 
-Just run `python pipelines/ATACseq.py -h` to see usage. You just need to pass a few command-line parameters to specify sample_name, reference genome, input files, etc. See example command in [cmd.sh](cmd.sh).
+To see the command-line options for usage, see [usage.txt](usage.txt), which you can get on the command line by running `pipelines/ATACseq.py --help`. You just need to pass a few command-line parameters to specify sample_name, reference genome, input files, etc. See example command in [cmd.sh](cmd.sh) using test data.
+
+To run on multiple samples, you can just write a loop to process each sample independently with the pipeline, or you can use *option 2*...
 
 ### Option 2: Running the pipeline through looper
 
@@ -129,6 +142,24 @@ grep "level 1" ${GENOME}.gtf | grep "gene" | awk  '{if($7=="+"){print $1"\t"$4"\
 
 ```
 
+### Optional summary plots
+
+1. Run `looper summarize` to generate a summary table in tab-separated values (TSV) format
+
+```
+looper summarize examples/test_project/test_config.yaml
+```
+
+2. Run `ATAC_Looper_Summary_plot.R` to produce summary plots.
+
+You must pass the full path to your TSV file that resulted from the call to looper summarize.
+```
+Rscript ATAC_Looper_Summary_plot.R </path/to/looper/summarize/summary.TSV>
+```
+
+This results in the output of multiple PDF plots in the directory containing the TSV input file.
+
+
 ## Using a cluster
 
 Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project.  You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html).

diff --git a/config/pipeline_interface.yaml b/config/pipeline_interface.yaml
@@ -29,4 +29,4 @@ ATACseq.py:
       file_size: "6"
       cores: "8"
       mem: "32000"
-      time: "3-00:00:00"
+      time: "3-00:00:00"
diff --git a/config/protocol_mappings.yaml b/config/protocol_mappings.yaml
@@ -1,2 +1,2 @@
 ATAC: ATACseq.py
-ATAC-SEQ: ATACseq.py
+ATAC-SEQ: ATACseq.py
diff --git a/pipeline_interface.yaml b/pipeline_interface.yaml
@@ -30,7 +30,7 @@ pipelines:
         file_size: "0.001"
         cores: "1"
         mem: "4000"
-        time: "00:20:00"
+        time: "00:40:00"
       pico:
         file_size: "0.05"
         cores: "1"