diff --git a/CHANGELOG.md b/CHANGELOG.md index c22412c5..d71c6ab2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,7 +1,18 @@ # Change log All notable changes to this project will be documented in this file. -## [0.7.0] -- Unreleased +## [0.7.0] -- 2018-06-25 + +### Added +- Added containerization feature + - Run with either [docker](https://www.docker.com/) or [singularity](https://singularity.lbl.gov/) +- Added early bowtie2 index check + +### Changed +- Renamed pipeline +- Improved summary figure reporting +- Integrated summary results into pipeline interface + ## [0.6.1] -- 2017-12-15 diff --git a/Makefile b/Makefile index 645702fb..adb35dd5 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,13 @@ microtest: - python $$CODEBASE/ATACseq/pipelines/ATACseq.py -I $$MICROTEST/data/atacR1.fq.gz -I2 $$MICROTEST/data/atacR2.fq.gz -G hg19 -O $$HOME/scratch -S atac_test --single-or-paired paired -R + python $$CODEBASE/pepatac/pipelines/pepatac.py -I $$MICROTEST/data/atacR1.fq.gz -I2 $$MICROTEST/data/atacR2.fq.gz -G hg19 -O $$HOME/scratch -S atac_test --single-or-paired paired -R test: - python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C ATACseq.yaml --genome-size hs --prealignments rCRSd human_repeats -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz + python pipelines/pepatac.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C pepatac.yaml --genome-size hs --prealignments rCRSd human_repeats -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz changtest: - python pipelines/ATACseq.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C $HOME/code/ATACseq/examples/chang_project/ATACseq.yaml -gs mm -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz + python pipelines/pepatac.py -P 3 -M 100 -O test_out -R -S liver -G hg19 -Q paired -C $HOME/code/pepatac/examples/chang_project/pepatac.yaml -gs mm -I examples/test_data/liver-CD31_test_R1.fastq.gz -I2 examples/test_data/liver-CD31_test_R2.fastq.gz + + +docker: + docker build -t databio/pepatac -f containers/pepatac.Dockerfile . + +singularity: + singularity build $${SIMAGES}pepatac docker://databio/pepatac \ No newline at end of file diff --git a/README.md b/README.md index 7337aaac..c67ed055 100644 --- a/README.md +++ b/README.md @@ -1,102 +1,237 @@ -# ATACseq pipeline +# 1. PEPATAC pipeline overview -This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs. You can download the latest version from the [releases page](https://github.com/databio/ATACseq/releases) and a history of version changes is in the [CHANGELOG](CHANGELOG.md). +This repository contains a pipeline to process ATAC-seq data. It does adapter trimming, mapping, peak calling, and creates bigwig tracks, TSS enrichment files, and other outputs. You can download the latest version from the [releases page](https://github.com/databio/pepatac/releases), or find changes listed in the [CHANGELOG](CHANGELOG.md). -## Pipeline features at-a-glance +# 2. Pipeline features at-a-glance -These features are explained in more detail later in this README. +Here are a few of the highlights that make `pepatac` valuable (explained in more detail later in this README). -**Prealignments**. The pipeline can (optionally) first align to any number of reference assemblies separately before the primary genome alignment. This increases both speed and accuracy and can be used, for example, to align sequentially to mtDNA, repeats, or spike-ins. +**Beautiful HTML reports**. Your results are an easy-to-navigate HTML report with a sample table, job status, summary statistics, and QC plots at your fingertips. + +**Scalability**. Run the pipeline easily on a project with a single sample or a thousand. This pipeline is compatible with [looper](https://github.com/pepkit/looper), so it can run locally, in a cloud container engine, or with *any* cluster resource manager (*e.g.* SLURM, SGE, or LFS). + +**Restartability**. The pipeline is built using [pypiper](https://github.com/databio/pypiper), so it automatically picks up where it left off in case of preemption or crash. + +**Copious logging**. The pipeline produces a detailed log file recording all output from every command run, and also records the time and memory use of every process, the version of the pipeline and other software, and other useful run information. + +**Flexibility**. The pipeline provides options for multiple peak callers, multiple adapter trimmers, and fully configurable parameterization for many underlying tools. + +**Portability**. Run it using `docker` or `singularity` with no other prerequisites, or it can be run natively without containers. The choice is yours. -**Scalability**. This pipeline is built on [looper](https://github.com/pepkit/looper), so it can run locally or with any cluster resource manager. +**Standardized user interface**. The pipeline reads sample metadata formatted in [standard PEP format](http://pepkit.github.io), so you can use the same sample annotation sheets for your downstream R or python analysis using tools from [pepkit](http://pepkit.github.io). + +**Standardized reference genome assembly**. The pipeline uses standard reference genome assemblies produced by [refgenie](http://github.com/databio/refgenie), which provides a *scripted* way to produce a compatible reference assembly for any custom genome. For common genomes, you can either download pre-indexed assemblies or build your own. **Fraction of reads in peaks (FRIP)**. By default, the pipeline will calculate the FRIP using the peaks it identifies. Optionally, it can **also** calculate a FRIP using a reference set of peaks (for example, from another experiment). **TSS enrichments**. The pipeline produces various nice QC plots. -## Installing +**Prealignments**. The pipeline can (optionally) first align to any number of reference assemblies separately before the primary genome alignment. This increases both speed and accuracy and can be used, for example, to align sequentially to mtDNA, repeats, or spike-ins. + + +# 3. Installation + +## 3.1 Clone the pipeline + +First, **clone the pipeline**. Clone this repository using one of these methods: +- using SSH: `git clone git@github.com:databio/pepatac.git` +- using HTTPS: `git clone https://github.com/databio/pepatac.git` -### Prequisites +Next, specify custom sequencing adapter file if desired (in [pipelines/pepatac.yaml](pipelines/pepatac.yaml)). -**Python packages**. This pipeline uses [pypiper](https://github.com/databio/pypiper) to run a single sample, [looper](https://github.com/pepkit/looper) to handle multi-sample projects (for either local or cluster computation), and [pararead](https://github.com/databio/pararead) for parallel processing sequence reads. You can do a user-specific install of these like this: +Next, you have two options for installing the software prerequisites: 1) use a container, in which case you need only either `docker` or `singularity`; or 2) install all prerequisites natively. If you want to install it natively, skip to the [native installation instructions](#33-native-approach). +## 3.2 Use containers + +First, make sure your environment is set up to run either [docker](http://docker.com) or [singularity](https://singularity.lbl.gov/) containers. Then, pull the container image: + +### Docker + +You can pull the `docker` image from `dockerhub` (https://hub.docker.com/r/databio/pepatac/) like this: + +``` +docker pull databio/pepatac ``` -pip install --user https://github.com/databio/pypiper/zipball/master + +Or build the image using the included [Dockerfile](containers/) (you can use a recipe in the included [Makefile](/Makefile)): + +``` +cd pepatac +make docker +``` + +### Singularity + +You can download the singularity image from http://big.databio.org/simages/pepatac or build it from the docker image following the recipe in the [Makefile](/Makefile): +``` +cd pepatac +make singularity +``` +Now you'll need to tell the pipeline where you saved the singularity image. You can either create an environment variable called `$SIMAGES` that points to the *folder where your image is stored*, or you can tweak the [pipeline_interface.yaml](pipeline_interface.yaml) file so that the `compute.singularity_image` attribute is pointing to the right location on disk. + +## 3.3. Install software requirements natively + +*Note: you only need to install these prerequisites if you are not using a container*. + +First we'll need to install all the prerequisites: + +**Python packages**. This pipeline uses [pypiper](https://github.com/databio/pypiper) to run a single sample, [looper](https://github.com/pepkit/looper) to handle multi-sample projects (for either local or cluster computation), and [pararead](https://github.com/databio/pararead) for parallel processing sequence reads. For peak calling, the pipeline uses [MACS2](https://github.com/taoliu/MACS/) as the default. You can do a user-specific install of these like this: + +``` +pip install --user piper pip install --user https://github.com/pepkit/looper/zipball/master pip install --user pararead +pip install --user MACS2 ``` + **R packages**. This pipeline uses R to generate QC metric plots. These are **optional** and if you don't install these R packages (or R in general), the pipeline will still work, but you will not get the QC plot outputs. The following packages are used by the qc scripts: -- ggplot2 +- argparser (v0.4) +- data.table (v1.11.2) +- ggplot2 (v2.2.1) - gplots (v3.0.1) -- reshape2 (v1.4.2) +- gtable (v0.2.0) +- pepr (v0.0.2) +- scales (v0.5.0) +- stringr (v1.3.1) You can install these packages like this: ``` R # start R -install.packages(c("ggplot2", "gplots", "reshape2")) +install.packages(c("argparser", "data.table", "ggplot2", "gplots", "gtable", "scales", "stringr")) +devtools::install_github('pepkit/pepr') ``` -**Required executables**. You will need some common bioinformatics tools installed. The list is specified in the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) tools section. +**Required executables**. You will need some common bioinformatics tools installed. The complete list (including optional tools) is specified in the pipeline configuration file ([pipelines/pepatac.yaml](pipelines/pepatac.yaml)) tools section. -**Genome resources**. This pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). You may [download pre-indexed references](http://cloud.databio.org/refgenomes) or you may index your own (see [refgenie](https://github.com/databio/refgenie) instructions). Any prealignments you want to do use will also require refgenie assemblies. Some common examples are provided by [ref_decoy](https://github.com/databio/ref_decoy). +The following tools are used by the pipeline: +- [bedtools](http://bedtools.readthedocs.io/en/latest/) (v2.25.0+) +- [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) (v2.2.9+) +- [fastqc](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (v0.11.5+) +- [picard](https://broadinstitute.github.io/picard/) (v2.17.4+) +- [samtools](http://www.htslib.org/) (v1.7) +- [skewer](https://github.com/relipmoc/skewer) (v0.1.126+) +- UCSC tools (v3.5.1) + - [bedGraphToBigWig](https://www.encodeproject.org/software/bedgraphtobigwig/) (v4) + - [wigToBigWig](https://www.encodeproject.org/software/wigtobigwig/) (v4) + - [bigWigCat](http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) (v4) + - [bedSort](http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/) -### Configuring the pipeline - -**Clone the pipeline**. Clone this repository using one of these methods: -- using SSH: `git clone git@github.com:databio/ATACseq.git` -- using HTTPS: `git clone https://github.com/databio/ATACseq.git` +With the software installed natively, we next need to configure the pipeline: There are two configuration options: You can either set up environment variables to fit the default configuration, or change the configuration file to fit your environment. For the Chang lab, you may use the pre-made config file and project template described on the [Chang lab configuration](examples/chang_project) page. For others, choose one: -**Configuration option 1: Default configuration** (recommended; [pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)). - - Make sure the executable tools (java, samtools, bowtie2, etc.) are in your PATH. +### 3.3.1 Configuration option 1: Default configuration (recommended; [pipelines/pepatac.yaml](pipelines/pepatac.yaml)). + - Make sure the executable tools (java, samtools, bowtie2, etc.) are in your PATH (unless using a container). - Set up environment variables to point to `jar` files for the java tools (`picard` and `trimmomatic`). ``` export PICARD="/path/to/picard.jar" export TRIMMOMATIC="/path/to/trimmomatic.jar" ``` - - - Define environment variable `GENOMES` for refgenie genomes. + + - Specify custom sequencing adapter file if desired (in [pipelines/pepatac.yaml](pipelines/pepatac.yaml)). + + +### 3.3.2 Configuration option 2: Custom configuration. + +Instead, you can also put absolute paths to each tool or resource in the configuration file to fit your local setup. Just change the pipeline configuration file ([pipelines/pepatac.yaml](pipelines/pepatac.yaml)) appropriately. + +# 4. Configuring reference genome assemblies + +## 4.1 Downloading refgenie assemblies + +Whether using the container or native version, you will need to provide external reference genome assemblies. The pipeline requires genome assemblies produced by [refgenie](https://github.com/databio/refgenie). + +One feature of the pipeline is *prealignments*, which siphons off reads by aligning to small genomes before the main alignment to the primary reference genome. Any prealignments you want to do use will also require refgenie assemblies. By default, the pipeline will pre-align to the mitochondrial genome, so you if you want to use the default settings, you will need refgenie assemblies for `rCRSd` genome (for human) or `mouse_chrM` (for mouse) in addition to the primary assembly you wish to use. Other ideas for common prealignment references are provided by [ref_decoy](https://github.com/databio/ref_decoy). + +You may [download pre-indexed references](http://big.databio.org/refgenomes) or you may index your own (see [refgenie](https://github.com/databio/refgenie) instructions). + + +## 4.2 Configuring the pipeline to use refgenie assemblies + +Once you've procured assemblies for all genomes you wish to use, you must point the pipeline to where you store these. You can do this in two ways, either: 1) with an environment variable, or 2) by adjusting a configuration option. + +The pipeline looks for genomes stored in a folder specified by the `resources.genomes` attribute in the [pipeline config file](pipelines/pepatac.yaml). By default, this points to the shell variable `GENOMES`, so all you have to do is set an environment variable to the location of your [refgenie](https://github.com/databio/refgenie) genomes: + ``` export GENOMES="/path/to/genomes/folder/" ``` - - - Specify custom sequencing adapter file if desired (in [pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)). + (Add this to your `.bashrc` or `.profile` to ensure it persists). + +Alternatively, you can skip the `GENOMES` variable and simply change the value of that configuration option to point to the folder where you stored the assemblies. The advantage of using an environment variable is that it makes the configuration file portable, so the same pipeline can be run on any computing environment, as the location to reference assemblies is not hard-coded to a specific computing environment. -**Configuration option 2: Custom configuration**. Instead, you can also put absolute paths to each tool or resource in the configuration file to fit your local setup. Just change the pipeline configuration file ([pipelines/ATACseq.yaml](pipelines/ATACseq.yaml)) appropriately. -## Usage +# 5. Running the pipeline -You have two options for running the pipeline. +The pipeline can be run directly from the command line for a single sample. If you need to run it on many samples, you could write your own sample handling code, but we have pre-configured everything to work nicely with [looper](http://looper.readthedocs.io), our sample handling engine. This section explains how to do both. +## 5.1 Running the pipeline script directly (without looper) +The pipeline is at core just a python script, and you can run it on the command line for a single sample. To see the command-line options for usage, see [usage.txt](usage.txt), which you can get on the command line by running `pipelines/pepatac.py --help`. You just need to pass a few command-line parameters to specify sample name, reference genome, input files, etc. See [example commands](example_cmd.txt) that use test data. -### Run option 1: Running the pipeline script directly +Here's the basic command to run the small test example through the pipeline: -To see the command-line options for usage, see [usage.txt](usage.txt), which you can get on the command line by running `pipelines/ATACseq.py --help`. You just need to pass a few command-line parameters to specify sample_name, reference genome, input files, etc. See [example commands](example_cmd.txt) that use test data. +``` +pipelines/pepatac.py --single-or-paired paired --genome hg38 --sample-name test1 --input examples/test_data/test1_r1.fastq.gz --input2 examples/test_data/test1_r2.fastq.gz --genome-size hs -O $HOME/test +``` + +This small example takes about 15 minutes to run to completion. -To run on multiple samples, you can just write a loop to process each sample independently with the pipeline, or you can use *option 2*... -### Run option 2: Running the pipeline through looper +## 5.1.1 Running the pipeline directly in a container -[Looper](http://looper.readthedocs.io/) is a pipeline submission engine that makes it easy to deploy this pipeline across samples. To use it, you will need to tell looper about your project. +A full tutorial on using containers it outside the scope of this readme, but here are the basics. Individual jobs can be run in a container by simply running the `pepatac.py` command through `docker run` or `singularity exec`. You can run containers either on your local computer, or in an HPC environment, as long as you have `docker` or `singularity` installed. For example, run it locally in singularity like this: -Start by running the example project in the [examples/test_project](examples/test_project) folder. This command runs the pipeline across all samples in the test project: ``` -cd ATACseq -looper run examples/test_project/test_config.yaml +singularity instance.start ${SIMAGES}pepatac atac_image +singularity exec instance://atac_image pipelines/pepatac.py +``` + +With docker, you can use: ``` +docker run --rm -it databio/pepatac pipelines/pepatac.py +``` + +Be sure to mount the volumes you need with `--volume`. + +To run on multiple samples, you can just write a loop to process each sample independently with the pipeline, or you can use...*looper*! -If the looper executable in not your `$PATH`, add the following line to your `.bashrc` or `.profile`: +## 5.2 Running the pipeline through looper + +This pipeline is pre-configured to work with `looper`. [Looper](http://looper.readthedocs.io/) is a pipeline submission engine that makes it easy to deploy any pipeline across samples. It will let you run the jobs locally, in containers, using any cluster resource manager, or in containers on a cluster. + +Install looper using `pip`: + +``` +pip install --user https://github.com/pepkit/looper/zipball/master +``` + +Start by running the example project in the [examples/test_project](examples/test_project) folder. Let's use the `-d` argument to do a *dry run*, which will create job scripts for every sample in the project, but will not execute them: + +``` +cd pepatac +looper run -d examples/test_project/test_config.yaml +``` + +If the looper executable is not in your `$PATH`, add the following line to your `.bashrc` or `.profile`: ``` export PATH=$PATH:~/.local/bin ``` -Now, adapt the example project to your project. Here's a quick start: You need to build two files for your project (follow examples in the [examples/test_project](examples/test_project/) folder): +If that worked, let's actually run the example by taking out the `-d` flag: + +``` +looper run examples/test_project/test_config.yaml +``` + +There are lots of other cool things you can do with looper, like dry runs, summarize results, check on pipeline run status, clean intermediate files, lump multiple samples into one job, and more. For details, consult the [looper docs](http://looper.readthedocs.io/). + +### Adapting the example for your own project + +Now, adapt the example project to your project. (There are more examples in the [examples/test_project](examples/test_project/) folder). You need two files for your project : - [project config file](examples/test_project/test_config.yaml) -- describes output locations, pointers to data, etc. - [sample annotation file](examples/test_project/test_annotation.csv) -- comma-separated value (CSV) list of your samples. @@ -109,23 +244,38 @@ Your annotation file must specify these columns: - read2 - whatever else you want -Run your project as above, by passing your project config file to `looper run`. More detailed instructions and advanced options for how to define your project are in the [Looper documentation on defining a project](http://looper.readthedocs.io/en/latest/define-your-project.html). Of particular interest may be the section on [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns). +Run your project as above, by passing your project config file to `looper` with `looper run project_config.yaml`. This project format called the *standard Portable Encapsulated Project format* and is outlined in more detail in the [pepkit docs](https://pepkit.github.io/docs/home/). + +Looper can also summarize your results, monitor your runs, clean intermediate files to save disk space, and more. You can find additional details on what you can do with this in the [looper docs](http://looper.readthedocs.io/). + +## 5.3 How to run on a cluster or in a container using looper + +The pipeline itself does not specify any container or cluster resources, so you *could* just roll your own and submit individual jobs to a cluster however you choose. The easier way is to use `looper`'s built-in template system, which `looper` uses to build flexible shell scripts for job submission. These templates can be used to run jobs in a container, to submit to a cluster resource manager, or both. + +### 5.3.1 PEPENV + +To use `looper` templates, we must create a *computing environment configuration file* called `PEPENV`. In short, you will need to: + +1. Set up a compute configuration file that includes a containerized or cluster compute template (or both). +2. Point the environment variable `PEPENV` to the location of this file. +3. Run the pipeline with `looper run --compute PACKAGE` (where PACKAGE is specified in your PEPENV file). -### Arguments +This enables you to adjust your computing preferences on-the-fly when you run a project. -The arguments to the pipeline are the same whether you run it using looper or on the command line. +The complete description of setting up `looper` to use `PEPENV` is generic to any pipeline, and therefore omitted from this readme. If you want to use `looper` with containers or clusters, you should consult the complete docs in the [pepenv readme](https://github.com/pepkit/pepenv). Further instructions can also be found in the documentation on [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html) and [configuring looper to use linux containers](https://looper.readthedocs.io/en/dev/containers.html). -## Outline of analysis steps -### Prealignments +# 6. Outline of analysis steps -Because of the high proportion of mtDNA reads in ATAC-seq data, we recommend first aligning to the mitochondrial DNA. This pipeline does this using prealignments, which are passed to the pipeline via the `--prealignments` argument. This has several advantages: it speeds up the process dramatically, and reduces noise from erroneous alignments (NuMTs). To do this, we use a doubled mtDNA reference that allows even non-circular aligners to draw all reads to the mtDNA. The pipeline will also align *sequentially* to other references, if provided via the `--prealignments` command-line option. For example, you may download the `repbase` assembly to align to all repeats. We have provided indexed assemblies for mtDNA and other repeat classes in the [ref_decoy](https://github.com/databio/ref_decoy) repository. The pipeline is already configured to work with these, but you can change to however you wish by adjusting the +## 6.1 Prealignments -### FRIP +Because of the high proportion of mtDNA reads in ATAC-seq data, we recommend first aligning to the mitochondrial DNA. This pipeline does this using prealignments, which are passed to the pipeline via the `--prealignments` argument. This has several advantages: it speeds up the process dramatically, and reduces noise from erroneous alignments (NuMTs). To do this, we use a doubled mtDNA reference that allows even non-circular aligners to draw all reads to the mtDNA. The pipeline will also align *sequentially* to other references, if provided via the `--prealignments` command-line option. For example, you may download the `repbase` assembly to align to all repeats. We have provided indexed assemblies for mtDNA and other repeat classes in the [ref_decoy](https://github.com/databio/ref_decoy) repository. The pipeline is already configured to work with these, but you can also adjust this parameter in your project_config.yaml file (see [project_config.yaml](/examples/gold_atac/metadata/project_config.yaml)) as opposed to providing it at the command-line. + +## 6.2 FRIP By default, the pipeline will calculate the FRIP as a quality control, using the peaks it identifies internally. If you want, it will **additionally** calculate a FRIP using a reference set of peaks (for example, from another experiment). For this you must provide a reference peak set (as a bed file) to the pipeline. You can do this by adding a column named `FRIP_ref` to your annotation sheet (see [pipeline_interface.yaml](/config/pipeline_interface.yaml)). Specify the reference peak filename (or use a derived column and specify the path in the project config file `data_sources` section). -### TSS enrichments +## 6.3 TSS enrichments In order to calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. Here's code to generate that. @@ -153,36 +303,42 @@ grep "level 1" ${GENOME}.gtf | grep "gene" | awk '{if($7=="+"){print $1"\t"$4"\ ``` -### Optional summary plots +## 6.4 Optional summary plots -1. Run `looper summarize` to generate a summary table in tab-separated values (TSV) format +1. Run `looper summarize` to generate a summary table in tab-separated values (TSV) format as well as an HTML report describing sample and project level results. ``` looper summarize examples/test_project/test_config.yaml ``` -2. Run `ATAC_Looper_Summary_plot.R` to produce summary plots. +2. You can specify in the [pipeline interface file](pipeline_interface.yaml) any additional summarizers you wish to run, such as the `PEPATAC_summarizer.R`. Alternatively, you may run `PEPATAC_summarizer.R` directly to produce summary plots. -You must pass the full path to your TSV file that resulted from the call to looper summarize. +You must pass the path to your project configuration file used to run looper. ``` -Rscript ATAC_Looper_Summary_plot.R +Rscript PEPATAC_summarizer.R examples/test_project/test_config.yaml ``` -This results in the output of multiple PDF plots in the directory containing the TSV input file. - - -## Using a cluster +This results in the output of multiple PDF plots in a summary/ subdirectory within the project parent directory. -Once you've specified your project to work with this pipeline, you will also inherit all the power of looper for your project. You can submit these jobs to a cluster with a simple change to your configuration file. Follow instructions in [configuring looper to use a cluster](http://looper.readthedocs.io/en/latest/cluster-computing.html). -Looper can also summarize your results, monitor your runs, clean intermediate files to save disk space, and more. You can find additional details on what you can do with this in the [looper docs](http://looper.readthedocs.io/). - -## Contributing +# 7. Contributing Pull requests welcome. Active development should occur in a development or feature branch. ## Contributors * Jin Xu, jinxu9@stanford.edu -* Nathan Sheffield +* Nathan Sheffield, nathan@code.databio.org +* Jason Smith, jasonsmith@virginia.edu +* Ryan Corces, rcorces@stanford.edu +* Vince Reuter, vince.reuter@gmail.com * Others... (add your name) + + +# 8. Troubleshooting + +A few common issues running the pipeline: + +- If 'fseq' fails to create a peaks file and the log file indicates an ArrayIndexOutOfBoundsException, this is likely due to a low read count and can probably be overcome by specifying fragment size with fseq's -f option. + + diff --git a/config/pipeline_interface.yaml b/config/pipeline_interface.yaml index a9e32848..996960b8 100644 --- a/config/pipeline_interface.yaml +++ b/config/pipeline_interface.yaml @@ -1,5 +1,5 @@ -ATACseq.py: - name: ATACseq +pepatac.py: + name: PEPATAC looper_args: True required_input_files: [read1, read2] all_input_files: [read1, read2] diff --git a/config/protocol_mappings.yaml b/config/protocol_mappings.yaml index 6c741356..eab7904f 100644 --- a/config/protocol_mappings.yaml +++ b/config/protocol_mappings.yaml @@ -1,2 +1,2 @@ -ATAC: ATACseq.py -ATAC-SEQ: ATACseq.py +ATAC: pepatac.py +ATAC-SEQ: pepatac.py diff --git a/containers/pepatac.Dockerfile b/containers/pepatac.Dockerfile new file mode 100644 index 00000000..61fa0f3b --- /dev/null +++ b/containers/pepatac.Dockerfile @@ -0,0 +1,157 @@ +# Pull base image +FROM phusion/baseimage:0.10.1 + +# Who maintains this image +LABEL maintainer Jason Smith "jasonsmith@virginia.edu" + +# Version info +LABEL version 0.8.1 + +# Use baseimage-docker's init system. +CMD ["/sbin/my_init"] + +# Install dependencies +RUN apt-get update && \ + DEBIAN_FRONTEND=noninteractive apt-get install --assume-yes \ + curl \ + default-jre \ + default-jdk \ + git \ + libcommons-math3-java \ + libcurl4-gnutls-dev \ + libjbzip2-java \ + libpng-dev \ + libssl-dev \ + libtbb2 \ + libtbb-dev \ + openssl \ + pigz \ + python \ + python-pip python-dev build-essential \ + wget + +# Install MySQL server +RUN DEBIAN_FRONTEND=noninteractive apt-get install --assume-yes mysql-server \ + mysql-client \ + libmysqlclient-dev + +# Install python tools +RUN pip install --upgrade pip +RUN pip install virtualenv && \ + pip install numpy && \ + pip install MACS2 && \ + pip install pararead && \ + pip install piper + +# Install R +RUN DEBIAN_FRONTEND=noninteractive apt-get --assume-yes install r-base r-base-dev && \ + echo "r <- getOption('repos'); r['CRAN'] <- 'http://cran.us.r-project.org'; options(repos = r);" > ~/.Rprofile && \ + Rscript -e "install.packages('devtools')" && \ + Rscript -e "devtools::install_github('pepkit/pepr')" && \ + Rscript -e "install.packages('gtable')" && \ + Rscript -e "install.packages('argparser')" && \ + Rscript -e "install.packages('ggplot2')" && \ + Rscript -e "install.packages('gplots')" && \ + Rscript -e "install.packages('grid')" && \ + Rscript -e "install.packages('scales')" && \ + Rscript -e "install.packages('data.table')" && \ + Rscript -e "install.packages('stringr')" + + +# Install bedtools +RUN DEBIAN_FRONTEND=noninteractive apt-get install --assume-yes \ + ant \ + bedtools + +# Install fastqc +WORKDIR /home/tools/ +RUN wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.7.zip && \ + unzip fastqc_v0.11.7.zip && \ + cd /home/tools/FastQC && \ + chmod 755 fastqc && \ + ln -s /home/tools/FastQC/fastqc /usr/bin/ + +# Install htslib +WORKDIR /home/src/ +RUN wget https://github.com/samtools/htslib/releases/download/1.7/htslib-1.7.tar.bz2 && \ + tar xf htslib-1.7.tar.bz2 && \ + cd /home/src/htslib-1.7 && \ + ./configure --prefix /home/tools/ && \ + make && \ + make install + +# Install samtools +WORKDIR /home/src/ +RUN wget https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2 && \ + tar xf samtools-1.7.tar.bz2 && \ + cd /home/src/samtools-1.7 && \ + ./configure && \ + make && \ + make install && \ + ln -s /home/src/samtools-1.7/samtools /usr/bin/ + +# Install bowtie2 +WORKDIR /home/src/ +RUN wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.4.1/bowtie2-2.3.4.1-source.zip && \ + unzip bowtie2-2.3.4.1-source.zip && \ + cd /home/src/bowtie2-2.3.4.1 && \ + make && \ + make install && \ + ln -s /home/src/bowtie2-2.3.4.1/bowtie2 /usr/bin/ + +# Install picard +WORKDIR /home/tools/bin +RUN wget https://github.com/broadinstitute/picard/releases/download/2.18.0/picard.jar && \ + chmod +x picard.jar + +# Install UCSC tools +WORKDIR /home/tools/ +RUN wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig && \ + wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig && \ + wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigCat && \ + wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedSort && \ + chmod +x /home/tools/bedGraphToBigWig && \ + chmod +x /home/tools/wigToBigWig && \ + chmod +x /home/tools/bigWigCat && \ + chmod +x /home/tools/bedSort && \ + ln -s /home/tools/bedGraphToBigWig /usr/bin/ && \ + ln -s /home/tools/wigToBigWig /usr/bin/ && \ + ln -s /home/tools/bigWigCat /usr/bin/ && \ + ln -s /home/tools/bedSort /usr/bin/ + +# Install Skewer +WORKDIR /home/src/ +RUN git clone git://github.com/relipmoc/skewer.git && \ + cd /home/src/skewer && \ + make && \ + make install + +# OPTIONAL REQUIREMENTS +# Install F-seq +WORKDIR /home/src/ +RUN wget https://github.com/aboyle/F-seq/archive/master.zip && \ + unzip master.zip && \ + cd /home/src/F-seq-master && \ + ant && \ + cd dist~/ && \ + tar xf fseq.tgz && \ + ln -s /home/src/F-seq-master/dist~/fseq/bin/fseq /usr/bin/ + +# Install Trimmomatic +WORKDIR /home/src/ +RUN wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.36.zip && \ + unzip Trimmomatic-0.36.zip && \ + chmod +x Trimmomatic-0.36/trimmomatic-0.36.jar + +# Set environment variables +ENV PATH=/home/tools/bin:/home/tools/:/home/tools/bin/kentUtils/:/home/src/F-seq-master/dist~/fseq/bin:/home/src/bowtie2-2.3.4.1:/home/src/skewer:/home/src/samtools-1.7:/home/src/Trimmomatic-0.36/:/home/src/htslib-1.7:$PATH \ + TRIMMOMATIC=/home/src/Trimmomatic-0.36/trimmomatic-0.36.jar \ + PICARD=/home/tools/bin/picard.jar \ + R_LIBS_USER=/usr/local/lib/R/site-library/ + +# Define default command +WORKDIR /home/ +CMD ["/bin/bash"] + +# Clean up APT when done. +RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \ No newline at end of file diff --git a/example_cmd.txt b/example_cmd.txt index b215593f..5b2705ca 100755 --- a/example_cmd.txt +++ b/example_cmd.txt @@ -1,13 +1,13 @@ -# Example commands of using pepATAC through pypiper. -# For the example commands of using pepATAC with looper, please see the xxx Users Guide. +# Example commands of using PEPATAC through pypiper. +# For the example commands of using PEPATAC with looper, please see the xxx Users Guide. INPUT=/path/to/sequencing_results/fastq_files -# run pepATAC on a human paired-end reads dataset using 5 threads: -python pipelines/ATACseq.py -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $INPUT/ATACseq_results_PE_R1.fastq.gz -I2 $INPUT/ATACseq_results_PE_R2.fastq.gz +# run PEPATAC on a human paired-end reads dataset using 5 threads: +python pipelines/pepatac.py -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $INPUT/pepatac_results_PE_R1.fastq.gz -I2 $INPUT/pepatac_results_PE_R2.fastq.gz -# run pepATAC on multiple datasets at the same time: <- this could be wrong as I don't see an explaination of how to use -I and -I2 with multiple samples -python pipelines/ATACseq.py -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $INPUT/ATACseq_results1_PE_R1.fastq.gz $INPUT/ATACseq_results2_PE_R1.fastq.gz $INPUT/ATACseq_results3_PE_R1.fastq.gz -I2 $INPUT/ATACseq_results1_PE_R2.fastq.gz $INPUT/ATACseq_results2_PE_R2.fastq.gz $INPUT/ATACseq_results3_PE_R2.fastq.gz +# run PEPATAC on multiple datasets at the same time: <- this could be wrong as I don't see an explaination of how to use -I and -I2 with multiple samples +python pipelines/pepatac.py -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $INPUT/pepatac_results1_PE_R1.fastq.gz $INPUT/pepatac_results2_PE_R1.fastq.gz $INPUT/pepatac_results3_PE_R1.fastq.gz -I2 $INPUT/pepatac_results1_PE_R2.fastq.gz $INPUT/pepatac_results2_PE_R2.fastq.gz $INPUT/pepatac_results3_PE_R2.fastq.gz # run multiple samples with a for loop: declare -a sample_name_arr=("sample1","sample2","sample3") @@ -15,76 +15,19 @@ for sample_name in "${sample_name_arr[@]}" do file1=$INPUT/{$file1}_PE_R1.fastq.gz file2=${file1/R1/R2} -python pipelines/ATACseq.py -P 5 -O output_folder -S $sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $file1 -I2 $file2 +python pipelines/pepatac.py -P 5 -O output_folder -S $sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $file1 -I2 $file2 done -# run pepATAC on a mouse single-end reads dataset using 8 threads: -python pipelines/ATACseq.py -P 8 -O output_folder -S output_sample_name -G mm10 -Q single -C ATACseq.yaml -gs mm -I $INPUT/ATACseq_results_PE_R1.fastq.gz +# run PEPATAC on a mouse single-end reads dataset using 8 threads: +python pipelines/pepatac.py -P 8 -O output_folder -S output_sample_name -G mm10 -Q single -C pepatac.yaml -gs mm -I $INPUT/pepatac_results_PE_R1.fastq.gz -# run pepATAC with different trimming tools then default trimmomatic, currectly supports skewer and pyadapt: -python pipelines/ATACseq.py --skewer TRUE -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $INPUT/ATACseq_results_PE_R1.fastq.gz -I2 $INPUT/ATACseq_results_PE_R2.fastq.gz -python pipelines/ATACseq.py --pyadapt TRUE -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $INPUT/ATACseq_results_PE_R1.fastq.gz -I2 $INPUT/ATACseq_results_PE_R2.fastq.gz - -# re-run pepATAC and over-write the previous output: -python pipelines/ATACseq.py -N -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $INPUT/ATACseq_results_PE_R1.fastq.gz -I2 $INPUT/ATACseq_results_PE_R2.fastq.gz - -# continue to run pepATAC since a locked step (usually locked due to failure): -python pipelines/ATACseq.py -R -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C ATACseq.yaml -gs hs -I $INPUT/ATACseq_results_PE_R1.fastq.gz -I2 $INPUT/ATACseq_results_PE_R2.fastq.gz - - - - -# check xxxx for full list of parameter usage - -# full list of parameters are listed below: -python ATACseq.py -usage: ATACseq.py [-h] [-N] [-I2 INPUT_FILES2 [INPUT_FILES2 ...]] -[-M MEMORY_LIMIT] [-Q SINGLE_OR_PAIRED] [-S SAMPLE_NAME] -[-P NUMBER_OF_CORES] [-D] [-I INPUT_FILES [INPUT_FILES ...]] -[-F] [-R] [-C CONFIG_FILE] [-O PARENT_OUTPUT_FOLDER] -[-G GENOME_ASSEMBLY] [-gs GENOME_SIZE] -[--frip-ref-peaks FRIP_REF_PEAKS] [--pyadapt] [--skewer] -[--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V] - -Pipeline -optional arguments: --C CONFIG_FILE, --config CONFIG_FILE -pipeline config file in YAML format; relative paths -are considered relative to the pipeline script. -defaults to ATACseq.yaml --D, --dirty Make all cleanups manual --F, --follow Run all follow commands, even if command is not run ---frip-ref-peaks FRIP_REF_PEAKS -Reference peak set for calculating FRIP --G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY -identifier for genome assempbly (required) --gs GENOME_SIZE, --genome-size GENOME_SIZE -genome size for MACS2 --h, --help show this help message and exit --I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] -One or more primary input files (required) --I2 INPUT_FILES2 [INPUT_FILES2 ...], --input2 INPUT_FILES2 [INPUT_FILES2 ...] -One or more secondary input files (if they exists); -for example, second read in pair. --M MEMORY_LIMIT, --mem MEMORY_LIMIT -Memory string for processes that accept memory limits -(like java) --N, --new-start Fresh start mode, overwrite all --O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER -parent output directory of the project (required). --P NUMBER_OF_CORES, --cores NUMBER_OF_CORES -number of cores to use for parallel processes --Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED -single or paired end? default: single --R, --recover Recover mode, overwrite locks --S SAMPLE_NAME, --sample-name SAMPLE_NAME -unique name for output subfolder and files (required) ---pyadapt Use pyadapter_trim for trimming? [Default: False] ---skewer Use skewer for trimming? [Default: False] ---prealignments PREALIGNMENTS [PREALIGNMENTS ...] -List of reference genomes to align to before primary -alignment. --V, --version show program's version number and exit' +# run PEPATAC with different trimming tools then default trimmomatic, currectly supports skewer and pyadapt: +python pipelines/pepatac.py --skewer TRUE -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $INPUT/pepatac_results_PE_R1.fastq.gz -I2 $INPUT/pepatac_results_PE_R2.fastq.gz +python pipelines/pepatac.py --pyadapt TRUE -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $INPUT/pepatac_results_PE_R1.fastq.gz -I2 $INPUT/pepatac_results_PE_R2.fastq.gz +# re-run PEPATAC and over-write the previous output: +python pipelines/pepatac.py -N -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $INPUT/pepatac_results_PE_R1.fastq.gz -I2 $INPUT/pepatac_results_PE_R2.fastq.gz +# continue to run PEPATAC since a locked step (usually locked due to failure): +python pipelines/pepatac.py -R -P 5 -O output_folder -S output_sample_name -G hg38 -Q paired -C pepatac.yaml -gs hs -I $INPUT/pepatac_results_PE_R1.fastq.gz -I2 $INPUT/pepatac_results_PE_R2.fastq.gz diff --git a/examples/chang_project/README.md b/examples/chang_project/README.md index c8cee7b6..b0dd278d 100644 --- a/examples/chang_project/README.md +++ b/examples/chang_project/README.md @@ -2,14 +2,14 @@ This folder contains an skeleton template with configuration options already set for the Chang lab compute environment. To set up a new project in the Chang lab compute environment, follow these instructions: -1. Follow the **installing** instructions in the main README to get prerequisites (install looper, pypiper, and clone the ATACseq repository). +1. Follow the **installing** instructions in the main README to get prerequisites (install looper, pypiper, and clone the PEPATAC repository). 2. Copy this folder ([examples/chang_project](examples/chang_project/)) and name the new folder for your project. -3. In your new folder, edit `project_config.yaml` to set the `metadata.pipelines_dir` option to the location of your cloned ATACseq repository. +3. In your new folder, edit `project_config.yaml` to set the `metadata.pipelines_dir` option to the location of your cloned PEPATAC repository. 4. Edit `project_config.yaml` to set the `data_sources.R1` and `data_sources.R2` to point to where you store fastq files. Your files must be named in some systematic pattern that can be created by populating sample variables, like `{sample_name}`. Detailed instructions are available here: [using looper derived columns](http://looper.readthedocs.io/en/latest/advanced.html#pointing-to-flexible-data-with-derived-columns). 5. Make any other (optional) changes you want to `project_config.yaml`. 6. Modify `project_annotation.csv` to include your sample list. 7. Run the project with `looper run path/to/project_config.yaml`. -Essentially, all this does differently from the default is that we have provided a configuration file. See the `pipeline_config` section in the [project config file](examples/chang_project/project_config.yaml) -- we simply set this to `ATACseq_chang.yaml` for your project, and then include [ATACseq_chang.yaml](examples/chang_project/ATACseq_chang.yaml) parallel to the project config file. +Essentially, all this does differently from the default is that we have provided a configuration file. See the `pipeline_config` section in the [project config file](examples/chang_project/project_config.yaml) -- we simply set this to `pepatac_chang.yaml` for your project, and then include [pepatac_chang.yaml](examples/chang_project/pepatac_chang.yaml) parallel to the project config file. Once you have it set up, you have all the power of looper for your project. It's simple to submit to a cluster, summarize your results, clean, and monitor your project. You can find additional details on what you can do with this in the [looper docs](http://looper.readthedocs.io/). \ No newline at end of file diff --git a/examples/chang_project/ATACseq_chang.yaml b/examples/chang_project/pepatac_chang.yaml similarity index 90% rename from examples/chang_project/ATACseq_chang.yaml rename to examples/chang_project/pepatac_chang.yaml index 15e37b43..3c05b872 100644 --- a/examples/chang_project/ATACseq_chang.yaml +++ b/examples/chang_project/pepatac_chang.yaml @@ -1,4 +1,4 @@ -# Configuration file for ATACseq pipeline based on pypiper +# PEPATAC configuration file for an ATACseq pipeline based on pypiper # basic tools # public tools diff --git a/examples/chang_project/project_config.yaml b/examples/chang_project/project_config.yaml index 4ce5e622..0bfb6924 100644 --- a/examples/chang_project/project_config.yaml +++ b/examples/chang_project/project_config.yaml @@ -33,6 +33,6 @@ implied_columns: prealignments: null pipeline_config: - ATACseq.py: ATACseq_chang.yaml # Use this to load Chang Lab settings - #ATACseq.py: null # Use this to load default environment settings + pepatac.py: pepatac_chang.yaml # Use this to load Chang Lab settings + #pepatac.py: null # Use this to load default environment settings \ No newline at end of file diff --git a/examples/gold_atac/README.md b/examples/gold_atac/README.md index db6bde27..68e7c2ea 100644 --- a/examples/gold_atac/README.md +++ b/examples/gold_atac/README.md @@ -8,7 +8,7 @@ Testing ATAC-seq pipeline on gold standard public ATAC-seq data. Download raw `fastq.gz` files (use `fastq-dump` from SRA. You may also use `get_geo.py` to download raw ATAC-seq reads from SRA and metadata from GEO: ``` -python get_geo.py -i ~/code/ATACseq/examples/gold_atac/metadata/gold_atac_gse.csv -r --fastq +python get_geo.py -i ~/code/pepatac/examples/gold_atac/metadata/gold_atac_gse.csv -r --fastq ``` I used resulting file [metadata/annocomb_gold_atac_gse.csv](metadata/annocomb_gold_atac_gse.csv) to create the looper metadata sheet, [metadata/gold_atac_annotation.csv](metadata/gold_atac_annotation.csv). @@ -18,5 +18,10 @@ I create project config file and sampled test data. The SRA fastq files should b ## Run pipeline ``` -looper run ${CODE}ATACseq/examples/gold_atac/metadata/project_config.yaml -d +looper run ${CODE}pepatac/examples/gold_atac/metadata/project_config.yaml -d ``` + + +## Test data + +There's a small test sample stored right here as `test1` \ No newline at end of file diff --git a/examples/gold_atac/metadata/gold_atac_annotation.csv b/examples/gold_atac/metadata/gold_atac_annotation.csv index 85428e3b..7308439a 100644 --- a/examples/gold_atac/metadata/gold_atac_annotation.csv +++ b/examples/gold_atac/metadata/gold_atac_annotation.csv @@ -1,7 +1,7 @@ -sample_name,sample_description,treatment_description,organism,library,data_source,SRR,SRX,Sample_geo_accession,Sample_series_id,single_or_paired,Sample_instrument_model,read1,read2 -test1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,TEST_1,TEST_2 -gold1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 -gold2,ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 -gold3,ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500,SRA_1,SRA_2 -gold4,ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 -gold5,ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 +sample_name,sample_description,treatment_description,organism,protocol,data_source,SRR,SRX,Sample_geo_accession,Sample_series_id,read_type,Sample_instrument_model,read1,read2 +test1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,TEST_1,TEST_2 +gold1,ATAC-seq from dendritic cell (ENCLB065VMV),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210416,SRX2523872,GSM2471255,GSE94182,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 +gold2,ATAC-seq from dendritic cell (ENCLB811FLK),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210450,SRX2523906,GSM2471300,GSE94222,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 +gold3,ATAC-seq from dendritic cell (ENCLB887PKE),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210398,SRX2523862,GSM2471249,GSE94177,PAIRED,Illumina NextSeq 500,SRA_1,SRA_2 +gold4,ATAC-seq from dendritic cell (ENCLB586KIS),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210428,SRX2523884,GSM2471269,GSE94196,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 +gold5,ATAC-seq from dendritic cell (ENCLB384NOX),Homo sapiens dendritic in vitro differentiated cells treated with 0 ng/mL Lipopolysaccharide for 0 hours,human,ATAC-seq,SRA,SRR5210390,SRX2523854,GSM2471245,GSE94173,PAIRED,Illumina HiSeq 2000,SRA_1,SRA_2 diff --git a/examples/gold_atac/metadata/project_config.yaml b/examples/gold_atac/metadata/project_config.yaml index 5cfee97a..39196d7b 100644 --- a/examples/gold_atac/metadata/project_config.yaml +++ b/examples/gold_atac/metadata/project_config.yaml @@ -3,7 +3,7 @@ metadata: # relative paths are relative to this config file sample_annotation: gold_atac_annotation.csv # sheet listing all samples in the project output_dir: ${PROCESSED}gold_atac # ABSOLUTE PATH to the parent, shared space where project results go - pipelines_dir: "${CODEBASE}ATACseq" # ABSOLUTE PATH the directory where looper will find the pipeline repository + pipeline_interfaces: ${CODEBASE}pepatac/pipeline_interface.yaml # in your sample_annotation, columns with these names will be populated as described # in the data_sources section below @@ -19,8 +19,8 @@ data_sources: # This section describes paths to your data SRA: "${SRABAM}{SRR}.bam" SRA_1: "${SRAFQ}{SRR}_1.fastq.gz" SRA_2: "${SRAFQ}{SRR}_2.fastq.gz" - TEST_1: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r1.fastq.gz" - TEST_2: "${CODEBASE}ATACseq/examples/test_data/{sample_name}_r2.fastq.gz" + TEST_1: "${CODEBASE}pepatac/examples/test_data/{sample_name}_r1.fastq.gz" + TEST_2: "${CODEBASE}pepatac/examples/test_data/{sample_name}_r2.fastq.gz" implied_columns: organism: @@ -31,4 +31,4 @@ implied_columns: mouse: genome: mm10 macs_genome_size: "mm" - prealignments: null \ No newline at end of file + prealignments: null diff --git a/examples/test_project/test_annotation.csv b/examples/test_project/test_annotation.csv index 5e458587..03847313 100644 --- a/examples/test_project/test_annotation.csv +++ b/examples/test_project/test_annotation.csv @@ -1,2 +1,2 @@ -sample_name,library,organism,read1,read2,read_length,read_type +sample_name,protocol,organism,read1,read2,read_length,read_type test1,ATAC,human,test_data_R1,test_data_R2,75,paired diff --git a/examples/test_project/test_config.yaml b/examples/test_project/test_config.yaml index 66d87e33..a03c3f0f 100644 --- a/examples/test_project/test_config.yaml +++ b/examples/test_project/test_config.yaml @@ -1,11 +1,11 @@ # This project config file describes your project. See looper docs for details. +name: test_project # The name that summary files will be prefaced with metadata: # relative paths are relative to this config file sample_annotation: test_annotation.csv # sheet listing all samples in the project - output_dir: $HOME/atac_out # ABSOLUTE PATH to the parent, shared space where project results go - pipelines_dir: "../../pipeline_interface.yaml" # ABSOLUTE PATH the directory where looper will find the pipeline repository - merge_table: null # (optional) input for samples with more than one input file + output_dir: "$HOME/atac_out" # ABSOLUTE PATH to the parent, shared space where project results go + pipeline_interfaces: "$CODEBASE/pepatac/pipeline_interface.yaml" # ABSOLUTE PATH to the directory where looper will find the pipeline repository # in your sample_annotation, columns with these names will be populated as described # in the data_sources section below @@ -18,8 +18,8 @@ data_sources: # This section describes paths to your data # Variable syntax: {column_name}. For example, use {sample_name} to populate # the file name with the value in the sample_name column for each sample. # example_data_source: "/path/to/data/{sample_name}_R1.fastq.gz" - test_data_R1: "examples/test_data/{sample_name}_r1.fastq.gz" - test_data_R2: "examples/test_data/{sample_name}_r2.fastq.gz" + test_data_R1: "$CODEBASE/pepatac/examples/test_data/{sample_name}_r1.fastq.gz" + test_data_R2: "$CODEBASE/pepatac/examples/test_data/{sample_name}_r2.fastq.gz" implied_columns: organism: diff --git a/pipeline_interface.yaml b/pipeline_interface.yaml index 2e11c4f5..aa323e70 100644 --- a/pipeline_interface.yaml +++ b/pipeline_interface.yaml @@ -1,11 +1,11 @@ protocol_mapping: - ATAC: ATACseq.py - ATAC-SEQ: ATACseq.py + ATAC: pepatac.py + ATAC-SEQ: pepatac.py pipelines: - ATACseq.py: - name: ATACseq - path: pipelines/ATACseq.py + pepatac.py: + name: PEPATAC + path: pipelines/pepatac.py looper_args: True required_input_files: [read1] all_input_files: [read1, read2] @@ -20,6 +20,27 @@ pipelines: "--frip-ref-peaks": FRIP_ref "--prealignments": prealignments "--genome-size": macs_genome_size + compute: + singularity_image: ${SIMAGES}pepatac + docker_image: databio/pepatac + summarizers: + - tools/PEPATAC_summarizer.R + summary_results: + - alignment_percent_file: + caption: "Alignment percent file" + description: "Plots percent of total alignment to all pre-alignments and primary genome." + thumbnail_path: "summary/{name}_alignmentPercent.png" + path: "summary/{name}_alignmentPercent.pdf" + - alignment_raw_file: + caption: "Alignment raw file" + description: "Plots raw alignment rates to all pre-alignments and primary genome." + thumbnail_path: "summary/{name}_alignmentRaw.png" + path: "summary/{name}_alignmentRaw.pdf" + - tss_file: + caption: "TSS enrichment file" + description: "Plots TSS scores for each sample." + thumbnail_path: "summary/{name}_TSSEnrichment.png" + path: "summary/{name}_TSSEnrichment.pdf" resources: default: file_size: "0" @@ -51,3 +72,5 @@ pipelines: cores: "16" mem: "64000" time: "03-00:00:00" + + diff --git a/pipelines/ATACseq.py b/pipelines/ATACseq.py deleted file mode 100755 index 0cf8eaba..00000000 --- a/pipelines/ATACseq.py +++ /dev/null @@ -1,698 +0,0 @@ -#!/usr/bin/env python -""" -ATACseq pipeline -""" - -__author__ = ["Jin Xu", "Nathan Sheffield"] -__email__ = "xujin937@gmail.com" -__version__ = "0.6.1" - - - -from argparse import ArgumentParser -import os -import sys -import tempfile -import pypiper -from pypiper import build_command - - -TOOLS_FOLDER = "tools" -PEAK_CALLERS = ["fseq", "macs2"] -TRIMMERS = ["trimmomatic", "pyadapt", "skewer"] - - - -def parse_arguments(): - """ - Parse command-line arguments passed to the pipeline. - """ - # Argument Parsing from yaml file - # ####################################################################################### - parser = ArgumentParser(description='Pipeline') - parser = pypiper.add_pypiper_args(parser, all_args = True) - - # Pipeline-specific arguments - parser.add_argument("-gs", "--genome-size", default="hs", type=str, - help="genome size for MACS2") - - parser.add_argument("--frip-ref-peaks", default=None, - dest="frip_ref_peaks", type=str, - help="Reference peak set for calculating FRIP") - - parser.add_argument("--peak-caller", dest="peak_caller", - default="macs2", choices=PEAK_CALLERS, - help="Name of peak caller; if 'fseq' fails to create " - "a peaks file and the log file indicates an " - "ArrayIndexOutOfBoundsException, this is likely " - "due to a low read count and can probably be " - "overcome by specifying fragment size with " - "fseq's -f option.") - - parser.add_argument("--trimmer", dest="trimmer", - default="skewer", choices=TRIMMERS, - help="Name of read trimming program") - - parser.add_argument("--prealignments", default=[], type=str, nargs="+", - help="Space-delimited list of reference genomes to align to before primary alignment.") - - parser.add_argument("-V", "--version", action="version", - version="%(prog)s {v}".format(v=__version__)) - - args = parser.parse_args() - - # TODO: determine if it's safe to handle this requirement with argparse. - # It may be that communication between pypiper and a pipeline via - # the pipeline interface (and/or) looper, and how the partial argument - # parsing is handled, that makes this more favorable. - if not args.input: - parser.print_help() - raise SystemExit - - return args - - - -def calc_frip(bamfile, peakfile, frip_func, pipeline_manager, aligned_reads_key="Aligned_reads"): - """ - Calculate the fraction of reads in peaks (FRIP). - - Use the given function and data from an aligned reads file and a called - peaks file, along with a PipelineManager, to calculate FRIP. - - :param str peakfile: path to called peaks file - :param callable frip_func: how to calculate the fraction of reads in peaks; - this must accept the path to the aligned reads file and the path to - the called peaks file as arguments. - :param str bamfile: path to aligned reads file - :param pypiper.PipelineManager pipeline_manager: the PipelineManager in use - for the pipeline calling this function - :param str aligned_reads_key: name of the key from a stats (key-value) file - to use to fetch the count of aligned reads - :return float: fraction of reads in peaks - """ - frip_cmd = frip_func(bamfile, peakfile) - num_peak_reads = pipeline_manager.checkprint(frip_cmd) - num_aligned_reads = pipeline_manager.get_stat(aligned_reads_key) - print(num_aligned_reads, num_peak_reads) - return float(num_peak_reads) / float(num_aligned_reads) - - - -def _align_with_bt2( - args, tools, unmap_fq1, unmap_fq2, - assembly_identifier, assembly_bt2, - outfolder, aligndir=None, bt2_opts_txt=None): - """ - A helper function to run alignments in series, so you can run one alignment followed - by another; this is useful for successive decoy alignments. - - :param argparse.Namespace args: binding between option name and argument, - e.g. from parsing command-line options - :param looper.models.AttributeDict tools: binding between tool name and - value, e.g. for tools/resources used by the pipeline - :param str unmap_fq1: path to unmapped read1 FASTQ file - :param str unmap_fq2: path to unmapped read2 FASTQ file - :param str assembly_identifier: text identifying a genome assembly for the - pipeline - :param str assembly_bt2: assembly-specific bowtie2 folder (index, etc.) - :param str outfolder: path to output directory for the pipeline - :param str aligndir: name of folder for temporary output - :param str bt2_opts_txt: command-line text for bowtie2 options - :return (str, str): pair (R1, R2) of paths to FASTQ files - """ - if os.path.exists(os.path.dirname(assembly_bt2)): - pm.timestamp("### Map to " + assembly_identifier) - if not aligndir: - align_subdir = "aligned_{}_{}".format(args.genome_assembly, assembly_identifier) - sub_outdir = os.path.join(outfolder, align_subdir) - else: - sub_outdir = os.path.join(outfolder, aligndir) - - ngstk.make_dir(sub_outdir) - bamname = "{}_{}.bam".format(args.sample_name, assembly_identifier) - mapped_bam = os.path.join(sub_outdir, bamname) - out_fastq_pre = os.path.join( - sub_outdir, args.sample_name + "_unmap_" + assembly_identifier) - out_fastq_bt2 = out_fastq_pre + '_R%.fq.gz' # bowtie2 unmapped filename format - - if not bt2_opts_txt: - # Default options - bt2_opts_txt = " -k 1" # Return only 1 alignment - bt2_opts_txt += " -D 20 -R 3 -N 1 -L 20 -i S,1,0.50" - bt2_opts_txt += " -X 2000" - - # samtools sort needs a temporary directory - tempdir = tempfile.mkdtemp(dir=sub_outdir) - pm.clean_add(tempdir) - - # Build bowtie2 command - cmd = tools.bowtie2 + " -p " + str(pm.cores) - cmd += bt2_opts_txt - cmd += " -x " + assembly_bt2 - cmd += " --rg-id " + args.sample_name - if args.paired_end: - cmd += " -1 " + unmap_fq1 + " -2 " + unmap_fq2 - else: - cmd += " -U " + unmap_fq1 - cmd += " --un-conc-gz " + out_fastq_bt2 - cmd += " | " + tools.samtools + " view -bS - -@ 1" # convert to bam - cmd += " | " + tools.samtools + " sort - -@ 1" # sort output - cmd += " -T " + tempdir - cmd += " -o " + mapped_bam - - # In this samtools sort command we print to stdout and then use > to - # redirect instead of `+ " -o " + mapped_bam` because then samtools - # uses a random temp file, so it won't choke if the job gets - # interrupted and restarted at this step. - - pm.run(cmd, mapped_bam, follow=lambda: _count_alignment( - assembly_identifier, mapped_bam, args.paired_end)) - - # filter genome reads not mapped - # unmapped_bam = os.path.join(sub_outdir, args.sample_name + "_unmap_" + assembly_identifier + ".bam") - # cmd = tools.samtools + " view -b -@ " + str(pm.cores) + " -f 12 " - # cmd += mapped_bam + " > " + unmapped_bam - - # cmd2, unmap_fq1, unmap_fq2 = ngstk.bam_to_fastq_awk(unmapped_bam, out_fastq_pre, args.paired_end) - # pm.run([cmd,cmd2], unmap_fq2) - unmap_fq1 = out_fastq_pre + "_R1.fq.gz" - unmap_fq2 = out_fastq_pre + "_R2.fq.gz" - return unmap_fq1, unmap_fq2 - else: - msg = "No {} index found in {}; skipping.".format( - assembly_identifier, os.path.dirname(assembly_bt2)) - print(msg) - return unmap_fq1, unmap_fq2 - - - -def _count_alignment(assembly_identifier, aligned_bam, paired_end): - """ - This function counts the aligned reads and alignment rate and reports statistics. You - must have previously reported a "Trimmed_reads" result to get alignment rates. It is useful - as a follow function after any alignment step to quantify and report the number of reads - aligning, and the alignment rate to that reference. - - :param str aligned_bam: Path to the aligned bam file. - :param str assembly_identifier: String identifying the reference to which you aligned (can be anything) - :param bool paired_end: Whether the sequencing employed a paired-end strategy. - """ - ar = ngstk.count_mapped_reads(aligned_bam, paired_end) - pm.report_result("Aligned_reads_" + assembly_identifier, ar) - try: - # wrapped in try block in case Trimmed_reads is not reported in this pipeline. - tr = float(pm.get_stat("Trimmed_reads")) - except: - print("Trimmed reads is not reported.") - else: - res_key = "Alignment_rate_" + assembly_identifier - pm.report_result(res_key, round(float(ar) * 100 / float(tr), 2)) - - - -def _get_bowtie2_index(genomes_folder, genome_assembly): - """ - Create path to genome assembly folder with refgenie structure. - - Convenience function that returns the bowtie2 index prefix (to be passed - to bowtie2) for a genome assembly that follows the folder structure - produced by the RefGenie reference builder. - - :param str genomes_folder: path to central genomes directory, i.e. the - root for multiple assembly subdirectories - :param str genome_assembly: name of the specific assembly of interest, - e.g. 'mm10' - :return str: path to bowtie2 index subfolder within central assemblies - home, for assembly indicated - """ - return os.path.join(genomes_folder, - genome_assembly, "indexed_bowtie2", genome_assembly) - - - -def tool_path(tool_name): - """ - Return the path to a tool used by this pipeline. - - :param str tool_name: name of the tool (e.g., a script filename) - :return str: real, absolute path to tool (expansion and symlink resolution) - """ - - return os.path.join(os.path.dirname(os.path.dirname(__file__)), TOOLS_FOLDER, tool_name) - - -def main(): - """ - Main pipeline process. - """ - - args = parse_arguments() - - args.paired_end = args.single_or_paired == "paired" - # TODO: for now, paired end sequencing input is required. - #args.paired_end = True - - # Initialize, creating global PipelineManager and NGSTk instance for - # access in ancillary functions outside of main(). - outfolder = os.path.abspath(os.path.join(args.output_parent, args.sample_name)) - global pm - pm = pypiper.PipelineManager( - name="ATACseq", outfolder=outfolder, args=args, version=__version__) - global ngstk - ngstk = pypiper.NGSTk(pm=pm) - - # Convenience alias - tools = pm.config.tools - param = pm.config.parameters - res = pm.config.resources - - # Set up reference resource according to genome prefix. - gfolder = os.path.join(res.genomes, args.genome_assembly) - res.chrom_sizes = os.path.join(gfolder, args.genome_assembly + ".chromSizes") - #res.TSS_file = os.path.join(gfolder, args.genome_assembly + ".refseq.TSS.txt") - res.TSS_file = os.path.join(gfolder, args.genome_assembly + "_TSS.tsv") - res.blacklist = os.path.join(gfolder, args.genome_assembly + ".blacklist.bed") - - # Get bowtie2 indexes - res.bt2_genome = _get_bowtie2_index(res.genomes, args.genome_assembly) - - # Adapter file can be set in the config; but if left null, we use a default. - res.adapters = res.adapters or tool_path("NexteraPE-PE.fa") - - param.outfolder = outfolder - - ################################################################################ - - print("Local input file: " + args.input[0]) - if args.input2: - print("Local input file: " + args.input2[0]) - - pm.report_result("File_mb", ngstk.get_file_size([x for x in [args.input, args.input2] if x is not None])) - pm.report_result("read_type", args.single_or_paired) - pm.report_result("Genome", args.genome_assembly) - - # ATACseq pipeline - # Each (major) step should have its own subfolder - - raw_folder = os.path.join(param.outfolder, "raw") - fastq_folder = os.path.join(param.outfolder, "fastq") - - pm.timestamp("### Merge/link and fastq conversion: ") - # This command will merge multiple inputs so you can use multiple sequencing lanes - # in a single pipeline run. - local_input_files = ngstk.merge_or_link( - [args.input, args.input2], raw_folder, args.sample_name) - cmd, out_fastq_pre, unaligned_fastq = ngstk.input_to_fastq( - local_input_files, args.sample_name, args.paired_end, fastq_folder) - pm.run(cmd, unaligned_fastq, - follow=ngstk.check_fastq(local_input_files, unaligned_fastq, args.paired_end)) - pm.clean_add(out_fastq_pre + "*.fastq", conditional=True) - print(local_input_files) - untrimmed_fastq1 = out_fastq_pre + "_R1.fastq" - untrimmed_fastq2 = out_fastq_pre + "_R2.fastq" if args.paired_end else None - - - ######################## - # Begin adapter trimming - ######################## - pm.timestamp("### Adapter trimming: ") - - # Create names for trimmed FASTQ files. - if args.trimmer == "trimmomatic": - trimming_prefix = os.path.join(fastq_folder, args.sample_name) - else: - trimming_prefix = out_fastq_pre - trimmed_fastq = trimming_prefix + "_R1.trim.fastq" - trimmed_fastq_R2 = trimming_prefix + "_R2.trim.fastq" - - # Create trimming command(s). - if args.trimmer == "pyadapt": - if not args.paried_end: - raise NotImplementedError("pyadapt trimming requires paired-end reads.") - # TODO: make pyadapt give options for output file name. - trim_cmd_chunks = [ - tool_path("pyadapter_trim.py"), - ("-a", local_input_files[0]), - ("-b", local_input_files[1]), - ("-o", out_fastq_pre), - "-u" - ] - cmd = build_command(trim_cmd_chunks) - - elif args.trimmer == "skewer": - # Create the primary skewer command. - trim_cmd_chunks = [ - tools.skewer, # + " --quiet" - ("-f", "sanger"), - ("-t", str(args.cores)), - ("-m", "pe" if args.paired_end else "any"), - ("-x", res.adapters), - "--quiet", - ("-o", out_fastq_pre), - untrimmed_fastq1, - untrimmed_fastq2 if args.paired_end else None - ] - trimming_command = build_command(trim_cmd_chunks) - - # Create the skewer file renaming commands. - if args.paired_end: - skewer_filename_pairs = [("{}-trimmed-pair1.fastq".format(out_fastq_pre), trimmed_fastq)] - skewer_filename_pairs.append(("{}-trimmed-pair2.fastq".format(out_fastq_pre), trimmed_fastq_R2)) - else: - skewer_filename_pairs = [("{}-trimmed.fastq".format(out_fastq_pre), trimmed_fastq)] - - trimming_renaming_commands = [build_command(["mv", old, new]) for old, new in skewer_filename_pairs] - # Rename the logfile. - #skewer_filename_pairs.append(("{}-trimmed.log".format(out_fastq_pre), trimLog)) - - # Pypiper submits the commands serially. - cmd = [trimming_command] + trimming_renaming_commands - - else: - # Default to trimmomatic. - - trim_cmd_chunks = [ - "{java} -Xmx{mem} -jar {trim} {PE} -threads {cores}".format( - java=tools.java, mem=pm.mem, trim=tools.trimmo, PE="PE" if args.paired_end else "", - cores=pm.cores), - local_input_files[0], - local_input_files[1], - trimmed_fastq, - trimming_prefix + "_R1_unpaired.fq", - trimmed_fastq_R2 if args.paired_end else "", - trimming_prefix + "_R2_unpaired.fq" if args.paired_end else "", - "ILLUMINACLIP:" + res.adapters + ":2:30:10" - ] - cmd = build_command(trim_cmd_chunks) - - pm.run(cmd, trimmed_fastq, - follow=ngstk.check_trim(trimmed_fastq, args.paired_end, trimmed_fastq_R2, - fastqc_folder=os.path.join(param.outfolder, "fastqc"))) - - pm.clean_add(os.path.join(fastq_folder, "*.fq"), conditional=True) - pm.clean_add(os.path.join(fastq_folder, "*.log"), conditional=True) - ######################### - # End adapter trimming - ######################### - - # Prepare variables for alignment step - unmap_fq1 = trimmed_fastq - unmap_fq2 = trimmed_fastq_R2 - - # Map to any requested prealignments - # We recommend mapping to chrM first for ATAC-seq data - pm.timestamp("### Prealignments") - if len(args.prealignments) == 0: - print("You may use `--prealignments` to align to references before the genome alignment step. See docs.") - else: - print("Prealignment assemblies: " + str(args.prealignments)) - # Loop through any prealignment references and map to them sequentially - for reference in args.prealignments: - unmap_fq1, unmap_fq2 = _align_with_bt2( - args, tools, unmap_fq1, unmap_fq2, reference, - assembly_bt2=_get_bowtie2_index(res.genomes, reference), - outfolder=param.outfolder, aligndir="prealignments") - - pm.timestamp("### Map to genome") - map_genome_folder = os.path.join(param.outfolder, "aligned_" + args.genome_assembly) - ngstk.make_dir(map_genome_folder) - - mapping_genome_bam = os.path.join(map_genome_folder, args.sample_name + ".pe.q10.sort.bam") - mapping_genome_bam_temp = os.path.join(map_genome_folder, args.sample_name + ".temp.bam") - unmap_genome_bam = os.path.join(map_genome_folder, args.sample_name + "_unmap.bam") - - bt2_options = " --very-sensitive" - bt2_options += " -X 2000" - - # samtools sort needs a temporary directory - tempdir = tempfile.mkdtemp(dir=map_genome_folder) - pm.clean_add(tempdir) - - cmd = tools.bowtie2 + " -p " + str(pm.cores) - cmd += bt2_options - cmd += " --rg-id " + args.sample_name - cmd += " -x " + res.bt2_genome - if args.paired_end: - cmd += " -1 " + unmap_fq1 + " -2 " + unmap_fq2 - else: - cmd += " -U " + unmap_fq1 - cmd += " | " + tools.samtools + " view -bS - -@ 1 " - #cmd += " -f 2 -q 10" # quality and pairing filter - cmd += " | " + tools.samtools + " sort - -@ 1" - cmd += " -T " + tempdir - cmd += " -o " + mapping_genome_bam_temp - - # Split genome mapping result bamfile into two: high-quality aligned reads (keepers) - # and unmapped reads (in case we want to analyze the altogether unmapped reads) - cmd2 = "samtools view -q 10 -b -@ " + str(pm.cores) + " " - if args.paired_end: - # add a step to accept only reads mapped in proper pair - cmd2 += "-f 2 " - - cmd2 += mapping_genome_bam_temp + " > " + mapping_genome_bam - - def check_alignment_genome(): - ar = ngstk.count_mapped_reads(mapping_genome_bam, args.paired_end) - pm.report_result("Aligned_reads", ar) - rr = float(pm.get_stat("Raw_reads")) - tr = float(pm.get_stat("Trimmed_reads")) - pm.report_result("Alignment_rate", round(float(ar) * 100 / float(tr), 2)) - pm.report_result("Total_efficiency", round(float(ar) * 100 / float(rr), 2)) - - pm.run([cmd, cmd2], mapping_genome_bam, follow=check_alignment_genome) - - # Now produce the unmapped file - cmd = "samtools view -b -@ " + str(pm.cores) - if args.paired_end: - # require both read and mate unmapped - cmd += " -f 12 " - else: - # require only read unmapped - cmd += " -f 4 " - - cmd += " " + mapping_genome_bam_temp + " > " + unmap_genome_bam - pm.run(cmd, unmap_genome_bam) - - pm.timestamp("### Remove dupes, build bigwig and bedgraph files") - - def estimate_lib_size(picard_log): - # In millions of reads; contributed by Ryan - cmd = "awk -F'\t' -f " + tool_path("extract_picard_lib.awk") + " " + picard_log - picard_est_lib_size = pm.checkprint(cmd) - pm.report_result("Picard_est_lib_size", picard_est_lib_size) - - rmdup_bam = os.path.join(map_genome_folder, args.sample_name + ".pe.q10.sort.rmdup.bam") - metrics_file = os.path.join(map_genome_folder, args.sample_name + "_picard_metrics_bam.txt") - picard_log = os.path.join(map_genome_folder, args.sample_name + "_picard_metrics_log.txt") - cmd3 = tools.java + " -Xmx" + str(pm.javamem) + " -jar " + tools.picard + " MarkDuplicates" - cmd3 += " INPUT=" + mapping_genome_bam - cmd3 += " OUTPUT=" + rmdup_bam - cmd3 += " METRICS_FILE=" + metrics_file - cmd3 += " VALIDATION_STRINGENCY=LENIENT" - cmd3 += " ASSUME_SORTED=true REMOVE_DUPLICATES=true > " + picard_log - cmd4 = tools.samtools + " index " + rmdup_bam - - pm.run([cmd3,cmd4], rmdup_bam, follow = lambda: estimate_lib_size(metrics_file)) - - # shift bam file and make bigwig file - # this script is only compatible with paired-end at the moment - if args.paired_end: - shift_bed = os.path.join(map_genome_folder, args.sample_name + ".pe.q10.sort.rmdup.bed") - cmd = tool_path("bam2bed_shift.pl") + " " + rmdup_bam - pm.run(cmd,shift_bed) - bedGraph = os.path.join( map_genome_folder , args.sample_name + ".pe.q10.sort.rmdup.bedGraph") - cmd = tools.bedtools + " genomecov -bg -split" - cmd += " -i " + shift_bed + " -g " + res.chrom_sizes + " > " + bedGraph - norm_bedGraph = os.path.join(map_genome_folder , args.sample_name + ".pe.q10.sort.rmdup.norm.bedGraph") - sort_bedGraph = os.path.join(map_genome_folder , args.sample_name + ".pe.q10.sort.rmdup.norm.sort.bedGraph") - cmd2 = "{} {} {}".format(tool_path("norm_bedGraph.pl"), bedGraph, norm_bedGraph) - bw_file = os.path.join(map_genome_folder , args.sample_name + ".pe.q10.rmdup.norm.bw") - - # bedGraphToBigWig requires lexicographical sort, which puts chr10 before chr2, for example - cmd3 = "LC_COLLATE=C sort -k1,1 -k2,2n " + norm_bedGraph + " > " + sort_bedGraph - cmd4 = tools.bedGraphToBigWig + " " + sort_bedGraph + " " + res.chrom_sizes + " " + bw_file - pm.run([cmd, cmd2, cmd3, cmd4], bw_file, nofail=True) - - - # "Exact cuts" are what I call nucleotide-resolution tracks of exact bases where the - # transposition (or DNAse cut) happened; - # In the past I used wigToBigWig on a combined wig file, but this ends up using a boatload - # of memory (more than 32GB); in contrast, running the wig -> bw conversion on each chrom - # and then combining them with bigWigCat requires much less memory. This was a memory bottleneck - # in the pipeline. - pm.timestamp("### Computing exact sites") - exact_folder = os.path.join(map_genome_folder + "_exact") - temp_exact_folder = os.path.join(exact_folder, "temp") - ngstk.make_dir(exact_folder) - ngstk.make_dir(temp_exact_folder) - temp_target = os.path.join(temp_exact_folder, "flag_completed") - exact_target = os.path.join(exact_folder, args.sample_name + "_exact.bw") - - # this is the old way to do it (to be removed) - # cmd = tool_path("bedToExactWig.pl") - # cmd += " " + shift_bed - # cmd += " " + res.chrom_sizes - # cmd += " " + temp_exact_folder - # cmd2 = "touch " + temp_target - # pm.run([cmd, cmd2], temp_target) - - # # Aside: since this bigWigCat command uses shell expansion, pypiper cannot profile memory. - # # we could use glob.glob if we want to preserve memory. like so: glob.glob(os.path.join(...)) - # # But I don't because bigWigCat uses little memory (<10GB). - # # This also sets us up nicely to process chromosomes in parallel. - # cmd = tools.bigWigCat + " " + exact_target + " " + os.path.join(temp_exact_folder, "*.bw") - # pm.run(cmd, exact_target) - # pm.clean_add(os.path.join(temp_exact_folder, "*.bw")) - - cmd = tool_path("bamSitesToWig.py") - cmd += " -b" # request bed output - cmd += " -i " + rmdup_bam - cmd += " -c " + res.chrom_sizes - cmd += " -o " + exact_target - cmd += " -p " + str(max(1, int(pm.cores) * 2/3)) - cmd2 = "touch " + temp_target - pm.run([cmd, cmd2], temp_target) - - if not args.paired_end: - # TODO, make this always (not just single-end) - shift_bed = exact_target + ".bed" - - # TSS enrichment - if not os.path.exists(res.TSS_file): - print("Skipping TSS -- TSS enrichment requires TSS annotation file: {}".format(res.TSS_file)) - else: - pm.timestamp("### Calculate TSS enrichment") - QC_folder = os.path.join(param.outfolder, "QC_" + args.genome_assembly) - ngstk.make_dir(QC_folder) - - Tss_enrich = os.path.join(QC_folder, args.sample_name + ".TssEnrichment") - cmd = tool_path("pyTssEnrichment.py") - cmd += " -a " + rmdup_bam + " -b " + res.TSS_file + " -p ends" - cmd += " -c " + str(pm.cores) - cmd += " -e 2000 -u -v -s 4 -o " + Tss_enrich - pm.run(cmd, Tss_enrich, nofail=True) - - #Call Rscript to plot TSS Enrichment - Tss_plot = os.path.join(QC_folder, args.sample_name + ".TssEnrichment.pdf") - cmd = "Rscript " + tool_path("ATAC_Rscript_TSSenrichmentPlot_pyPiper.R") - cmd += " " + Tss_enrich + " pdf" - pm.run(cmd, Tss_plot, nofail=True) - - # Always plot strand specific TSS enrichment. - # added by Ryan 2/10/17 to calculate TSS score as numeric and to include in summary stats - # This could be done in prettier ways which I'm open to. Just adding for the idea - with open(Tss_enrich) as f: - floats = map(float,f) - Tss_score = (sum(floats[1950:2050])/100)/(sum(floats[1:200])/200) - pm.report_result("TSS_Score", Tss_score) - try: - # Just wrapping this in a try temporarily so that old versions of - # pypiper will work. v0.6 release of pypiper adds this function - pm.report_figure("TSS enrichment", Tss_plot) - except: - pass - - # fragment distribution - fragL = os.path.join(QC_folder , args.sample_name + ".fragLen.txt") - frag_dist_tool = tool_path("fragment_length_dist.pl") - cmd = build_command([tools.perl, frag_dist_tool, rmdup_bam, fragL]) - frag_length_counts_file = args.sample_name + ".frag_count.txt" - fragL_count = os.path.join(QC_folder, frag_length_counts_file) - cmd1 = "sort -n " + fragL + " | uniq -c > " + fragL_count - fragL_dis1 = os.path.join(QC_folder, args.sample_name + ".fragL.distribution.pdf") - fragL_dis2 = os.path.join(QC_folder, args.sample_name + ".fragL.distribution.txt") - cmd2 = build_command( - [tools.Rscript, tool_path("fragment_length_dist.R"), - fragL, fragL_count, fragL_dis1, fragL_dis2]) - pm.run([cmd, cmd1, cmd2], fragL_dis1, nofail=True) - - # Peak calling - - pm.timestamp("### Call peaks") - - def report_peak_count(): - num_peaksfile_lines = int(ngstk.count_lines(peak_output_file).strip()) - num_peaks = max(0, num_peaksfile_lines - 1) - pm.report_result("Peak_count", num_peaks) - - peak_folder = os.path.join(param.outfolder, "peak_calling_" + args.genome_assembly) - ngstk.make_dir(peak_folder) - peak_output_file = os.path.join(peak_folder, args.sample_name + "_peaks.narrowPeak") - peak_input_file = shift_bed - - if args.peak_caller == "fseq": - fseq_cmd_chunks = [tools.fseq, ("-o", peak_folder)] - - # Parse only a subset of fseq options. - for fseq_opt in ["of", "l", "t", "s"]: - fseq_value = param.fseq[fseq_opt] - # TODO: use more natural try/except once PipelineManager parameters AD is strict. - if fseq_value == fseq_opt: - # Non-strict pipeline parameters AttributeDict returns key itself if missing. - continue - # We're building a command, so even non-text values need no special handling. - fseq_optval = ("-{}".format(fseq_opt), fseq_value) - fseq_cmd_chunks.append(fseq_optval) - - # Create the peak calling command - fseq_cmd = build_command(fseq_cmd_chunks) - - # Create the file merge/delete commands. - chrom_peak_files = os.path.join(peak_folder, "*.npf") - merge_chrom_peaks_files = "cat {peakfiles} > {combined_peak_file}".format( - peakfiles=chrom_peak_files, combined_peak_file=peak_output_file) - delete_chrom_peaks_files = "rm {}".format(chrom_peak_files) - - # Pypiper serially executes the commands. - cmd = [fseq_cmd, merge_chrom_peaks_files, delete_chrom_peaks_files] - - else: - # MACS2 - macs_cmd_chunks = [ - "{} callpeak".format(tools.macs2), - ("-t", peak_input_file), - "-f BED", - ("-g", args.genome_size), - ("--outdir", peak_folder), - ("-n", args.sample_name), - ("-q", param.macs2.q), - ("--shift", param.macs2.shift), - "--nomodel" - ] - # Note: required input file is non-positional ("treatment" file -t) - cmd = build_command(macs_cmd_chunks) - - # Call peaks and report peak count. - pm.run(cmd, peak_output_file, follow=report_peak_count) - - - # Filter peaks in blacklist. - if os.path.exists(res.blacklist): - filter_peak = os.path.join(peak_folder, args.sample_name + "_peaks.narrowPeak.rmBlacklist") - cmd = tools.bedtools + " intersect " + " -a " + peak_output_file + " -b " + res.blacklist + " -v >" + filter_peak - - pm.run(cmd, filter_peak) - - pm.timestamp("### # Calculate fraction of reads in peaks (FRIP)") - - frip = calc_frip(rmdup_bam, peak_output_file, frip_func=ngstk.simple_frip, pipeline_manager=pm) - pm.report_result("FRIP", frip) - - if args.frip_ref_peaks and os.path.exists(args.frip_ref_peaks): - # Use an external reference set of peaks instead of the peaks called from this run - frip_ref = calc_frip(rmdup_bam, args.frip_ref_peaks, frip_func=ngstk.simple_frip, pipeline_manager=pm) - pm.report_result("FRIP_ref", frip_ref) - - pm.stop_pipeline() - - - -if __name__ == '__main__': - pm = None - ngstk = None # TODO: remove once ngstk become less instance-y, more function-y. - try: - sys.exit(main()) - except KeyboardInterrupt: - print("Pipeline aborted.") - sys.exit(1) diff --git a/pipelines/pepatac.py b/pipelines/pepatac.py new file mode 100755 index 00000000..4fc2386e --- /dev/null +++ b/pipelines/pepatac.py @@ -0,0 +1,924 @@ +#!/usr/bin/env python +""" +PEPATAC - ATACseq pipeline +""" + +__author__ = ["Jin Xu", "Nathan Sheffield", "Jason Smith"] +__email__ = "xujin937@gmail.com" +__version__ = "0.7.0" + + +from argparse import ArgumentParser +import os +import sys +import tempfile +import pypiper +from pypiper import build_command + +TOOLS_FOLDER = "tools" +PEAK_CALLERS = ["fseq", "macs2"] +TRIMMERS = ["trimmomatic", "pyadapt", "skewer"] + + +def parse_arguments(): + """ + Parse command-line arguments passed to the pipeline. + """ + # Argument Parsing from yaml file + ########################################################################### + parser = ArgumentParser(description='PEPATAC version ' + __version__) + parser = pypiper.add_pypiper_args(parser, groups= + ['pypiper', 'looper', 'ngs'], required=["input", "genome", "sample-name"]) + + # Pipeline-specific arguments + parser.add_argument("-gs", "--genome-size", default="hs", type=str, + help="genome size for MACS2") + + parser.add_argument("--frip-ref-peaks", default=None, + dest="frip_ref_peaks", type=str, + help="Reference peak set for calculating FRIP") + + parser.add_argument("--peak-caller", dest="peak_caller", + default="macs2", choices=PEAK_CALLERS, + help="Name of peak caller") + + parser.add_argument("--trimmer", dest="trimmer", + default="skewer", choices=TRIMMERS, + help="Name of read trimming program") + + parser.add_argument("--prealignments", default=[], type=str, nargs="+", + help="Space-delimited list of reference genomes to " + "align to before primary alignment.") + + parser.add_argument("-V", "--version", action="version", + version="%(prog)s {v}".format(v=__version__)) + + args = parser.parse_args() + + # TODO: determine if it's safe to handle this requirement with argparse. + # It may be that communication between pypiper and a pipeline via + # the pipeline interface (and/or) looper, and how the partial argument + # parsing is handled, that makes this more favorable. + if not args.input: + parser.print_help() + raise SystemExit + + return args + + +def calc_frip(bamfile, peakfile, frip_func, pipeline_manager, + aligned_reads_key="Aligned_reads"): + """ + Calculate the fraction of reads in peaks (FRIP). + + Use the given function and data from an aligned reads file and a called + peaks file, along with a PipelineManager, to calculate FRIP. + + :param str peakfile: path to called peaks file + :param callable frip_func: how to calculate the fraction of reads in peaks; + this must accept the path to the aligned reads file and the path to + the called peaks file as arguments. + :param str bamfile: path to aligned reads file + :param pypiper.PipelineManager pipeline_manager: the PipelineManager in use + for the pipeline calling this function + :param str aligned_reads_key: name of the key from a stats (key-value) file + to use to fetch the count of aligned reads + :return float: fraction of reads in peaks + """ + frip_cmd = frip_func(bamfile, peakfile) + num_peak_reads = pipeline_manager.checkprint(frip_cmd) + num_aligned_reads = pipeline_manager.get_stat(aligned_reads_key) + print(num_aligned_reads, num_peak_reads) + return float(num_peak_reads) / float(num_aligned_reads) + + +def _align_with_bt2(args, tools, unmap_fq1, unmap_fq2, assembly_identifier, + assembly_bt2, outfolder, aligndir=None, bt2_opts_txt=None): + """ + A helper function to run alignments in series, so you can run one alignment + followed by another; this is useful for successive decoy alignments. + + :param argparse.Namespace args: binding between option name and argument, + e.g. from parsing command-line options + :param looper.models.AttributeDict tools: binding between tool name and + value, e.g. for tools/resources used by the pipeline + :param str unmap_fq1: path to unmapped read1 FASTQ file + :param str unmap_fq2: path to unmapped read2 FASTQ file + :param str assembly_identifier: text identifying a genome assembly for the + pipeline + :param str assembly_bt2: assembly-specific bowtie2 folder (index, etc.) + :param str outfolder: path to output directory for the pipeline + :param str aligndir: name of folder for temporary output + :param str bt2_opts_txt: command-line text for bowtie2 options + :return (str, str): pair (R1, R2) of paths to FASTQ files + """ + if os.path.exists(os.path.dirname(assembly_bt2)): + pm.timestamp("### Map to " + assembly_identifier) + if not aligndir: + align_subdir = "aligned_{}_{}".format(args.genome_assembly, + assembly_identifier) + sub_outdir = os.path.join(outfolder, align_subdir) + else: + sub_outdir = os.path.join(outfolder, aligndir) + + ngstk.make_dir(sub_outdir) + bamname = "{}_{}.bam".format(args.sample_name, assembly_identifier) + mapped_bam = os.path.join(sub_outdir, bamname) + out_fastq_pre = os.path.join( + sub_outdir, args.sample_name + "_unmap_" + assembly_identifier) + # bowtie2 unmapped filename format + out_fastq_bt2 = out_fastq_pre + '_R%.fq.gz' + + if not bt2_opts_txt: + # Default options + bt2_opts_txt = " -k 1" # Return only 1 alignment + bt2_opts_txt += " -D 20 -R 3 -N 1 -L 20 -i S,1,0.50" + bt2_opts_txt += " -X 2000" + + # samtools sort needs a temporary directory + tempdir = tempfile.mkdtemp(dir=sub_outdir) + pm.clean_add(tempdir) + + # Build bowtie2 command + cmd = tools.bowtie2 + " -p " + str(pm.cores) + cmd += bt2_opts_txt + cmd += " -x " + assembly_bt2 + cmd += " --rg-id " + args.sample_name + if args.paired_end: + cmd += " -1 " + unmap_fq1 + " -2 " + unmap_fq2 + else: + cmd += " -U " + unmap_fq1 + cmd += " --un-conc-gz " + out_fastq_bt2 + # TODO: Pipes break singularity exec command... use shell? + cmd += " | " + tools.samtools + " view -bS - -@ 1" # convert to bam + cmd += " | " + tools.samtools + " sort - -@ 1" # sort output + cmd += " -T " + tempdir + cmd += " -o " + mapped_bam + + # In this samtools sort command we print to stdout and then use > to + # redirect instead of `+ " -o " + mapped_bam` because then samtools + # uses a random temp file, so it won't choke if the job gets + # interrupted and restarted at this step. + + # TODO: this follow command leads to a samtools not found error... + # likely need to modify _count_alignment function... + pm.run(cmd, mapped_bam, follow=lambda: _count_alignment( + assembly_identifier, mapped_bam, args.paired_end), + container=pm.container) + + # filter genome reads not mapped + # unmapped_bam = os.path.join(sub_outdir, args.sample_name + + # "_unmap_" + assembly_identifier + ".bam") + # cmd = tools.samtools + " view -b -@ " + str(pm.cores) + " -f 12 " + # cmd += mapped_bam + " > " + unmapped_bam + + # cmd2, unmap_fq1, unmap_fq2 = \ + # ngstk.bam_to_fastq_awk(unmapped_bam, out_fastq_pre, args.paired_end) + # pm.run([cmd,cmd2], unmap_fq2, container=pm.container) + unmap_fq1 = out_fastq_pre + "_R1.fq.gz" + unmap_fq2 = out_fastq_pre + "_R2.fq.gz" + return unmap_fq1, unmap_fq2 + else: + msg = "No {} index found in {}; skipping.".format( + assembly_identifier, os.path.dirname(assembly_bt2)) + print(msg) + return unmap_fq1, unmap_fq2 + + +def _count_alignment(assembly_identifier, aligned_bam, paired_end): + """ + This function counts the aligned reads and alignment rate and reports + statistics. You must have previously reported a "Trimmed_reads" result to + get alignment rates. It is useful as a follow function after any alignment + step to quantify and report the number of reads aligning, and the alignment + rate to that reference. + + :param str aligned_bam: Path to the aligned bam file. + :param str assembly_identifier: String identifying the reference to which + you aligned (can be anything) + :param bool paired_end: Whether the sequencing employed a paired-end + strategy. + """ + ar = ngstk.count_mapped_reads(aligned_bam, paired_end) + pm.report_result("Aligned_reads_" + assembly_identifier, ar) + try: + # wrapped in try block in case Trimmed_reads is not reported in this + # pipeline. + tr = float(pm.get_stat("Trimmed_reads")) + except: + print("Trimmed reads is not reported.") + else: + res_key = "Alignment_rate_" + assembly_identifier + pm.report_result(res_key, round(float(ar) * 100 / float(tr), 2)) + + +def _get_bowtie2_index(genomes_folder, genome_assembly): + """ + Create path to genome assembly folder with refgenie structure. + + Convenience function that returns the bowtie2 index prefix (to be passed + to bowtie2) for a genome assembly that follows the folder structure + produced by the RefGenie reference builder. + + :param str genomes_folder: path to central genomes directory, i.e. the + root for multiple assembly subdirectories + :param str genome_assembly: name of the specific assembly of interest, + e.g. 'mm10' + :return str: path to bowtie2 index subfolder within central assemblies + home, for assembly indicated + """ + return os.path.join(genomes_folder, genome_assembly, + "indexed_bowtie2", genome_assembly) + + +def _check_bowtie2_index(genomes_folder, genome_assembly): + """ + Confirm bowtie2 index is present. + + Checks by simple file count whether the bowtie2 index for a genome + assembly (as produced by the RefGenie reference builder) contains the + correct number of non-empty files. + + :param str genomes_folder: path to central genomes directory, i.e. the + root for multiple assembly subdirectories + :param str genome_assembly: name of the specific assembly of interest, + e.g. 'mm10' + """ + bt2_path = os.path.join(genomes_folder, genome_assembly, "indexed_bowtie2") + + if os.path.isdir(bt2_path): + if not os.listdir(bt2_path): + err_msg = "There are no {} bowtie2 index files" + pm.fail_pipeline(Exception(err_msg.format(genome_assembly))) + else: + path, dirs, files = next(os.walk(bt2_path)) + else: + err_msg = "There is no bowtie2 index directory for {}" + pm.fail_pipeline(Exception(err_msg.format(genome_assembly))) + # check for bowtie small index + if [bt for bt in files if bt.endswith('bt2')]: + bt = ['.1.bt2', '.2.bt2', '.3.bt2', '.4.bt2', + '.rev.1.bt2', '.rev.2.bt2'] + # check for bowtie large index + elif [bt for bt in files if bt.endswith('bt2l')]: + bt = ['.1.bt2l', '.2.bt2l', '.3.bt2l', '.4.bt2l', + '.rev.1.bt2l', '.rev.2.bt2l'] + # if neither file type present, fail + else: + err_msg = "There are no bowtie2 index files for {} in {}" + pm.fail_pipeline(Exception( + err_msg.format(genome_assembly, genomes_folder))) + + bt_expected = [genome_assembly + s for s in bt] + bt_present = [bt for bt in files if any(s in bt for s in bt_expected)] + bt_missing = list(set(bt_expected) - set(bt_present)) + # if there are any missing files (bowtie2 file naming is constant), fail + if bt_missing: + err_msg = "The {} bowtie2 index is missing the following file(s): {}" + pm.fail_pipeline(IOError( + err_msg.format(genome_assembly, + ', '.join(str(s) for s in bt_missing)))) + else: + for f in bt_present: + # If any bowtie2 files are empty, fail + if os.stat(os.path.join(bt2_path, f)).st_size == 0: + err_msg = "The bowtie2 index file, {}, is empty." + pm.fail_pipeline(IOError(err_msg.format(f))) + + genome_file = genome_assembly + ".fa" + fa_files = [fa for fa in files if genome_file in fa] + if not fa_files: + # The fasta file does not exist + pm.fail_pipeline(IOError("{}.fa does not exist.".format(genome_assembly))) + for f in fa_files: + if os.stat(os.path.join(bt2_path, f)).st_size == 0: + pm.fail_pipeline(IOError("{} is an empty file.".format(f))) + +def tool_path(tool_name): + """ + Return the path to a tool used by this pipeline. + + :param str tool_name: name of the tool (e.g., a script filename) + :return str: real, absolute path to tool (expansion and symlink resolution) + """ + + return os.path.join(os.path.dirname(os.path.dirname(__file__)), + TOOLS_FOLDER, tool_name) + + +def main(): + """ + Main pipeline process. + """ + + args = parse_arguments() + + args.paired_end = args.single_or_paired == "paired" + + # Initialize, creating global PipelineManager and NGSTk instance for + # access in ancillary functions outside of main(). + outfolder = os.path.abspath( + os.path.join(args.output_parent, args.sample_name)) + global pm + pm = pypiper.PipelineManager( + name="PEPATAC", outfolder=outfolder, args=args, version=__version__) + global ngstk + ngstk = pypiper.NGSTk(pm=pm) + + # Convenience alias + tools = pm.config.tools + param = pm.config.parameters + res = pm.config.resources + + # Set up reference resource according to genome prefix. + gfolder = os.path.join(res.genomes, args.genome_assembly) + res.chrom_sizes = os.path.join( + gfolder, args.genome_assembly + ".chromSizes") + # res.TSS_file = os.path.join(gfolder, args.genome_assembly + + # ".refseq.TSS.txt") + res.TSS_file = os.path.join(gfolder, args.genome_assembly + "_TSS.tsv") + res.blacklist = os.path.join( + gfolder, args.genome_assembly + ".blacklist.bed") + + # Get bowtie2 indexes + res.bt2_genome = _get_bowtie2_index(res.genomes, args.genome_assembly) + _check_bowtie2_index(res.genomes, args.genome_assembly) + for reference in args.prealignments: + _check_bowtie2_index(res.genomes, reference) + + # Adapter file can be set in the config; if left null, we use a default. + res.adapters = res.adapters or tool_path("NexteraPE-PE.fa") + + param.outfolder = outfolder + + print("Local input file: " + args.input[0]) + if args.input2: + print("Local input file: " + args.input2[0]) + + container = None + + ########################################################################### + + pm.report_result( + "File_mb", + ngstk.get_file_size( + [x for x in [args.input, args.input2] if x is not None])) + pm.report_result("read_type", args.single_or_paired) + pm.report_result("Genome", args.genome_assembly) + + # ATACseq pipeline + # Each (major) step should have its own subfolder + + raw_folder = os.path.join(param.outfolder, "raw") + fastq_folder = os.path.join(param.outfolder, "fastq") + + pm.timestamp("### Merge/link and fastq conversion: ") + # This command will merge multiple inputs so you can use multiple + # sequencing lanes in a single pipeline run. + local_input_files = ngstk.merge_or_link( + [args.input, args.input2], raw_folder, args.sample_name) + cmd, out_fastq_pre, unaligned_fastq = ngstk.input_to_fastq( + local_input_files, args.sample_name, args.paired_end, fastq_folder) + pm.run(cmd, unaligned_fastq, + follow=ngstk.check_fastq( + local_input_files, unaligned_fastq, args.paired_end), + container=pm.container) + pm.clean_add(out_fastq_pre + "*.fastq", conditional=True) + print(local_input_files) + untrimmed_fastq1 = out_fastq_pre + "_R1.fastq" + untrimmed_fastq2 = out_fastq_pre + "_R2.fastq" if args.paired_end else None + + ######################## + # Begin adapter trimming + ######################## + pm.timestamp("### Adapter trimming: ") + + # Create names for trimmed FASTQ files. + if args.trimmer == "trimmomatic": + trimming_prefix = os.path.join(fastq_folder, args.sample_name) + else: + trimming_prefix = out_fastq_pre + trimmed_fastq = trimming_prefix + "_R1.trim.fastq" + trimmed_fastq_R2 = trimming_prefix + "_R2.trim.fastq" + + # Create trimming command(s). + if args.trimmer == "pyadapt": + if not args.paried_end: + raise NotImplementedError( + "pyadapt trimming requires paired-end reads.") + # TODO: make pyadapt give options for output file name. + trim_cmd_chunks = [ + tool_path("pyadapter_trim.py"), + ("-a", local_input_files[0]), + ("-b", local_input_files[1]), + ("-o", out_fastq_pre), + "-u" + ] + cmd = build_command(trim_cmd_chunks) + + elif args.trimmer == "skewer": + # Create the primary skewer command. + trim_cmd_chunks = [ + tools.skewer, # + " --quiet" + ("-f", "sanger"), + ("-t", str(args.cores)), + ("-m", "pe" if args.paired_end else "any"), + ("-x", res.adapters), + "--quiet", + ("-o", out_fastq_pre), + untrimmed_fastq1, + untrimmed_fastq2 if args.paired_end else None + ] + trimming_command = build_command(trim_cmd_chunks) + + # Create the skewer file renaming commands. + if args.paired_end: + skewer_filename_pairs = \ + [("{}-trimmed-pair1.fastq".format(out_fastq_pre), + trimmed_fastq)] + skewer_filename_pairs.append( + ("{}-trimmed-pair2.fastq".format(out_fastq_pre), + trimmed_fastq_R2)) + else: + skewer_filename_pairs = \ + [("{}-trimmed.fastq".format(out_fastq_pre), trimmed_fastq)] + + trimming_renaming_commands = [build_command(["mv", old, new]) + for old, new in skewer_filename_pairs] + # Rename the logfile. + # skewer_filename_pairs.append( + # ("{}-trimmed.log".format(out_fastq_pre), trimLog)) + + # Pypiper submits the commands serially. + cmd = [trimming_command] + trimming_renaming_commands + + else: + # Default to trimmomatic. + if pm.container is not None: + trim_cmd_chunks = [ + "{java} -Xmx{mem} -jar {trim} {PE} -threads {cores}".format( + java=tools.java, mem=pm.mem, + trim="/home/src/Trimmomatic-0.36/trimmomatic-0.36.jar", + PE="PE" if args.paired_end else "", + cores=pm.cores), + local_input_files[0], + local_input_files[1], + trimmed_fastq, + trimming_prefix + "_R1_unpaired.fq", + trimmed_fastq_R2 if args.paired_end else "", + trimming_prefix + "_R2_unpaired.fq" if args.paired_end else "", + "ILLUMINACLIP:" + res.adapters + ":2:30:10" + ] + else: + trim_cmd_chunks = [ + "{java} -Xmx{mem} -jar {trim} {PE} -threads {cores}".format( + java=tools.java, mem=pm.mem, + trim=tools.trimmo, + PE="PE" if args.paired_end else "", + cores=pm.cores), + local_input_files[0], + local_input_files[1], + trimmed_fastq, + trimming_prefix + "_R1_unpaired.fq", + trimmed_fastq_R2 if args.paired_end else "", + trimming_prefix + "_R2_unpaired.fq" if args.paired_end else "", + "ILLUMINACLIP:" + res.adapters + ":2:30:10" + ] + cmd = build_command(trim_cmd_chunks) + + pm.run(cmd, trimmed_fastq, + follow=ngstk.check_trim( + trimmed_fastq, args.paired_end, trimmed_fastq_R2, + fastqc_folder=os.path.join(param.outfolder, "fastqc")), + container=pm.container) + + + # Create names for FastQC results + fastqc_r1 = os.path.join(param.outfolder, "fastqc", + args.sample_name + "_R1.trim_fastqc.html") + fastqc_r2 = os.path.join(param.outfolder, "fastqc", + args.sample_name + "_R2.trim_fastqc.html") + + # Report FastQC results + pm.report_object("FastQC report r1", fastqc_r1, anchor_image=fastqc_r1) + pm.report_object("FastQC report r2", fastqc_r2, anchor_image=fastqc_r2) + + pm.clean_add(os.path.join(fastq_folder, "*.fq"), conditional=True) + pm.clean_add(os.path.join(fastq_folder, "*.log"), conditional=True) + ######################### + # End adapter trimming + ######################### + + # Prepare variables for alignment step + unmap_fq1 = trimmed_fastq + unmap_fq2 = trimmed_fastq_R2 + + # Map to any requested prealignments + # We recommend mapping to chrM first for ATAC-seq data + pm.timestamp("### Prealignments") + if len(args.prealignments) == 0: + print("You may use `--prealignments` to align to references before" + "the genome alignment step. See docs.") + else: + print("Prealignment assemblies: " + str(args.prealignments)) + # Loop through any prealignment references and map to them sequentially + for reference in args.prealignments: + unmap_fq1, unmap_fq2 = _align_with_bt2( + args, tools, unmap_fq1, unmap_fq2, reference, + assembly_bt2=_get_bowtie2_index(res.genomes, reference), + outfolder=param.outfolder, aligndir="prealignments") + + pm.timestamp("### Map to genome") + map_genome_folder = os.path.join( + param.outfolder, "aligned_" + args.genome_assembly) + ngstk.make_dir(map_genome_folder) + + mapping_genome_bam = os.path.join( + map_genome_folder, args.sample_name + ".pe.q10.sort.bam") + mapping_genome_bam_temp = os.path.join( + map_genome_folder, args.sample_name + ".temp.bam") + unmap_genome_bam = os.path.join( + map_genome_folder, args.sample_name + "_unmap.bam") + + bt2_options = " --very-sensitive" + bt2_options += " -X 2000" + + # samtools sort needs a temporary directory + tempdir = tempfile.mkdtemp(dir=map_genome_folder) + pm.clean_add(tempdir) + + cmd = tools.bowtie2 + " -p " + str(pm.cores) + cmd += bt2_options + cmd += " --rg-id " + args.sample_name + cmd += " -x " + res.bt2_genome + if args.paired_end: + cmd += " -1 " + unmap_fq1 + " -2 " + unmap_fq2 + else: + cmd += " -U " + unmap_fq1 + cmd += " | " + tools.samtools + " view -bS - -@ 1 " + # cmd += " -f 2 -q 10" # quality and pairing filter + cmd += " | " + tools.samtools + " sort - -@ 1" + cmd += " -T " + tempdir + cmd += " -o " + mapping_genome_bam_temp + + # Split genome mapping result bamfile into two: high-quality aligned + # reads (keepers) and unmapped reads (in case we want to analyze the + # altogether unmapped reads) + cmd2 = "samtools view -q 10 -b -@ " + str(pm.cores) + " " + if args.paired_end: + # add a step to accept only reads mapped in proper pair + cmd2 += "-f 2 " + + cmd2 += mapping_genome_bam_temp + " > " + mapping_genome_bam + + def check_alignment_genome(): + ar = ngstk.count_mapped_reads(mapping_genome_bam, args.paired_end) + pm.report_result("Aligned_reads", ar) + rr = float(pm.get_stat("Raw_reads")) + tr = float(pm.get_stat("Trimmed_reads")) + pm.report_result("Alignment_rate", round(float(ar) * 100 / + float(tr), 2)) + pm.report_result("Total_efficiency", round(float(ar) * 100 / + float(rr), 2)) + + pm.run([cmd, cmd2], mapping_genome_bam, follow=check_alignment_genome, + container=pm.container) + + # Now produce the unmapped file + unmap_cmd = "samtools view -b -@ " + str(pm.cores) + if args.paired_end: + # require both read and mate unmapped + unmap_cmd += " -f 12 " + else: + # require only read unmapped + unmap_cmd += " -f 4 " + + unmap_cmd += " " + mapping_genome_bam_temp + " > " + unmap_genome_bam + pm.run(unmap_cmd, unmap_genome_bam, container=pm.container) + # Remove temporary bam file from unmapped file production + pm.clean_add(mapping_genome_bam_temp) + + pm.timestamp("### Remove dupes, build bigwig and bedgraph files") + + def estimate_lib_size(picard_log): + # In millions of reads; contributed by Ryan + # NOTE: from Picard manual: without optical duplicate counts, + # library size estimation will be inaccurate. + cmd = ("awk -F'\t' -f " + tool_path("extract_picard_lib.awk") + + " " + picard_log) + picard_est_lib_size = pm.checkprint(cmd) + pm.report_result("Picard_est_lib_size", picard_est_lib_size) + + def post_dup_aligned_reads(picard_log): + # Number of aligned reads post tools.picard REMOVE_DUPLICATES + cmd = ("awk -F'\t' -f " + + tool_path("extract_post_dup_aligned_reads.awk") + " " + + picard_log) + pdar = pm.checkprint(cmd) + ar = float(pm.get_stat("Aligned_reads")) + rr = float(pm.get_stat("Raw_reads")) + tr = float(pm.get_stat("Trimmed_reads")) + if pdar == "": + pdar = ar + + dr = float(ar) - float(pdar) + dar = round(float(pdar) * 100 / float(tr), 2) + dte = round(float(pdar) * 100 / float(rr), 2) + pm.report_result("Duplicate_reads", dr) + pm.report_result("Dedup_aligned_reads", pdar) + pm.report_result("Dedup_alignment_rate", dar) + pm.report_result("Dedup_total_efficiency", dte) + + rmdup_bam = os.path.join( + map_genome_folder, args.sample_name + ".pe.q10.sort.rmdup.bam") + metrics_file = os.path.join( + map_genome_folder, args.sample_name + "_picard_metrics_bam.txt") + picard_log = os.path.join( + map_genome_folder, args.sample_name + "_picard_metrics_log.txt") + # the tools.picard command is being generated from OUTSIDE THE CONTAINER!!! + # therefore, it doesn't know how to expand that variable!!!! + # if I run as docker shell, it would expand, but not with exec! + # This method works, but is messy + # if pm.container is not None: + # # target is a file, not output + # picard_temp = os.path.join(map_genome_folder, "picard.txt") + # cmd = "printenv PICARD >" + picard_temp + # pm.run(cmd, picard_temp, container=pm.container, clean=True) + # cmd = "cat " + picard_temp + # picard = pm.checkprint(cmd).rstrip() + # cmd3 = (tools.java + " -Xmx" + str(pm.javamem) + " -jar " + + # picard + " MarkDuplicates") + # This also works, but is hard-coded... + # TODO: Alternative thought, in pypiper, check if command uses shell, + # then check if command contains pipes...then if it does, split + # on those pipes and add a singularity exec instance:// before + # each command, and let the host shell do the piping. + if pm.container is not None: + cmd3 = (tools.java + " -Xmx" + str(pm.javamem) + " -jar " + + "/home/tools/bin/picard.jar" + " MarkDuplicates") + else: + cmd3 = (tools.java + " -Xmx" + str(pm.javamem) + " -jar " + + tools.picard + " MarkDuplicates") + cmd3 += " INPUT=" + mapping_genome_bam + cmd3 += " OUTPUT=" + rmdup_bam + cmd3 += " METRICS_FILE=" + metrics_file + cmd3 += " VALIDATION_STRINGENCY=LENIENT" + cmd3 += " ASSUME_SORTED=true REMOVE_DUPLICATES=true > " + picard_log + cmd4 = tools.samtools + " index " + rmdup_bam + + pm.run([cmd3, cmd4], rmdup_bam, + follow=lambda: post_dup_aligned_reads(metrics_file), + container=pm.container) + + # shift bam file and make bigwig file + # this script is only compatible with paired-end at the moment + if args.paired_end: + shift_bed = os.path.join( + map_genome_folder, args.sample_name + ".pe.q10.sort.rmdup.bed") + cmd = tool_path("bam2bed_shift.pl") + " " + rmdup_bam + pm.run(cmd, shift_bed, container=pm.container) + bedGraph = os.path.join(map_genome_folder, args.sample_name + + ".pe.q10.sort.rmdup.bedGraph") + cmd = tools.bedtools + " genomecov -bg -split" + cmd += " -i " + shift_bed + " -g " + res.chrom_sizes + " > " + bedGraph + norm_bedGraph = os.path.join( + map_genome_folder, args.sample_name + + ".pe.q10.sort.rmdup.norm.bedGraph") + sort_bedGraph = os.path.join( + map_genome_folder, args.sample_name + + ".pe.q10.sort.rmdup.norm.sort.bedGraph") + cmd2 = "{} {} {}".format(tool_path("norm_bedGraph.pl"), + bedGraph, norm_bedGraph) + bw_file = os.path.join( + map_genome_folder, args.sample_name + ".pe.q10.rmdup.norm.bw") + + # bedGraphToBigWig requires lexicographical sort, which puts chr10 + # before chr2, for example + # NOTE: original cmd3 is NOT container friendly...use bedSort instead + #cmd3 = ("LC_COLLATE=C sort -k1,1 -k2,2n " + norm_bedGraph + " > " + + # sort_bedGraph) + cmd3 = tools.bedSort + " " + norm_bedGraph + " " + sort_bedGraph + cmd4 = (tools.bedGraphToBigWig + " " + sort_bedGraph + " " + + res.chrom_sizes + " " + bw_file) + pm.run([cmd, cmd2, cmd3, cmd4], bw_file, nofail=True, + container=pm.container) + + # "Exact cuts" are what I call nucleotide-resolution tracks of exact bases + # where the transposition (or DNAse cut) happened; + # In the past I used wigToBigWig on a combined wig file, but this ends up + # using a boatload of memory (more than 32GB); in contrast, running the + # wig -> bw conversion on each chrom and then combining them with bigWigCat + # requires much less memory. This was a memory bottleneck in the pipeline. + pm.timestamp("### Computing exact sites") + exact_folder = os.path.join(map_genome_folder + "_exact") + temp_exact_folder = os.path.join(exact_folder, "temp") + ngstk.make_dir(exact_folder) + ngstk.make_dir(temp_exact_folder) + temp_target = os.path.join(temp_exact_folder, "flag_completed") + exact_target = os.path.join(exact_folder, args.sample_name + "_exact.bw") + + # this is the old way to do it (to be removed) + # cmd = tool_path("bedToExactWig.pl") + # cmd += " " + shift_bed + # cmd += " " + res.chrom_sizes + # cmd += " " + temp_exact_folder + # cmd2 = "touch " + temp_target + # pm.run([cmd, cmd2], temp_target, container=pm.container) + + # # Aside: since this bigWigCat command uses shell expansion, pypiper + # # cannot profile memory. We could use glob.glob if we want to preserve + # # memory; like so: glob.glob(os.path.join(...)). But I don't because + # # bigWigCat uses little memory (<10GB). + # # This also sets us up nicely to process chromosomes in parallel. + # cmd = (tools.bigWigCat + " " + exact_target + " " + + # os.path.join(temp_exact_folder, "*.bw")) + # pm.run(cmd, exact_target, container=pm.container) + # pm.clean_add(os.path.join(temp_exact_folder, "*.bw")) + + cmd = tool_path("bamSitesToWig.py") + cmd += " -b" # request bed output + cmd += " -i " + rmdup_bam + cmd += " -c " + res.chrom_sizes + cmd += " -o " + exact_target + cmd += " -p " + str(max(1, int(pm.cores) * 2/3)) + cmd2 = "touch " + temp_target + pm.run([cmd, cmd2], temp_target, container=pm.container) + + if not args.paired_end: + # TODO, make this always (not just single-end) + shift_bed = exact_target + ".bed" + + # TSS enrichment + if not os.path.exists(res.TSS_file): + print("Skipping TSS -- TSS enrichment requires TSS annotation file: {}" + .format(res.TSS_file)) + else: + pm.timestamp("### Calculate TSS enrichment") + QC_folder = os.path.join(param.outfolder, "QC_" + args.genome_assembly) + ngstk.make_dir(QC_folder) + + Tss_enrich = os.path.join(QC_folder, args.sample_name + + ".TssEnrichment") + cmd = tool_path("pyTssEnrichment.py") + cmd += " -a " + rmdup_bam + " -b " + res.TSS_file + " -p ends" + cmd += " -c " + str(pm.cores) + cmd += " -e 2000 -u -v -s 4 -o " + Tss_enrich + pm.run(cmd, Tss_enrich, nofail=True, container=pm.container) + + # Call Rscript to plot TSS Enrichment + Tss_plot = os.path.join(QC_folder, args.sample_name + + ".TssEnrichment.pdf") + cmd = ("Rscript " + + tool_path("PEPATAC_TSSenrichmentPlot.R")) + cmd += " " + Tss_enrich + " pdf" + pm.run(cmd, Tss_plot, nofail=True, container=pm.container) + + # Always plot strand specific TSS enrichment. + # added by Ryan 2/10/17 to calculate TSS score as numeric and to + # include in summary stats. This could be done in prettier ways which + # I'm open to. Just adding for the idea. + with open(Tss_enrich) as f: + floats = map(float, f) + try: + # If the TSS enrichment is 0, don't report + Tss_score = ((sum(floats[1950:2050]) / 100) / + (sum(floats[1:200]) / 200)) + pm.report_result("TSS_Score", Tss_score) + except ZeroDivisionError: + #print("ZeroDivisionError: {0}".format(err)) + pass + try: + # Just wrapping this in a try temporarily so that old versions of + # pypiper will work. v0.6 release of pypiper adds this function + Tss_png = os.path.join(QC_folder, args.sample_name + + ".TssEnrichment.png") + pm.report_object("TSS enrichment", Tss_plot, anchor_image=Tss_png) + except: + pass + + # fragment distribution + fragL = os.path.join(QC_folder, args.sample_name + ".fragLen.txt") + frag_dist_tool = tool_path("fragment_length_dist.pl") + cmd = build_command([tools.perl, frag_dist_tool, rmdup_bam, fragL]) + frag_length_counts_file = args.sample_name + ".frag_count.txt" + fragL_count = os.path.join(QC_folder, frag_length_counts_file) + cmd1 = "sort -n " + fragL + " | uniq -c > " + fragL_count + fragL_dis1 = os.path.join(QC_folder, args.sample_name + + ".fragL.distribution.pdf") + fragL_dis2 = os.path.join(QC_folder, args.sample_name + + ".fragL.distribution.txt") + cmd2 = build_command( + [tools.Rscript, tool_path("fragment_length_dist.R"), + fragL, fragL_count, fragL_dis1, fragL_dis2]) + pm.run([cmd, cmd1, cmd2], fragL_dis1, nofail=True, + container=pm.container) + try: + fragL_png = os.path.join(QC_folder, args.sample_name + + ".fragL.distribution.png") + + pm.report_object("Fragment distribution", fragL_dis1, anchor_image=fragL_png) + except: + pass + + # Peak calling + + pm.timestamp("### Call peaks") + + def report_peak_count(): + num_peaksfile_lines = int(ngstk.count_lines(peak_output_file).strip()) + num_peaks = max(0, num_peaksfile_lines - 1) + pm.report_result("Peak_count", num_peaks) + + peak_folder = os.path.join(param.outfolder, "peak_calling_" + + args.genome_assembly) + ngstk.make_dir(peak_folder) + peak_output_file = os.path.join(peak_folder, args.sample_name + + "_peaks.narrowPeak") + peak_input_file = shift_bed + + if args.peak_caller == "fseq": + fseq_cmd_chunks = [tools.fseq, ("-o", peak_folder)] + + # Parse only a subset of fseq options. + for fseq_opt in ["of", "l", "t", "s"]: + fseq_value = param.fseq[fseq_opt] + # TODO: use more natural try/except once PipelineManager parameters + # AD is strict. + if fseq_value == fseq_opt: + # Non-strict pipeline parameters AttributeDict returns key + # itself if missing. + continue + # We're building a command, so even non-text values need no special + # handling. + fseq_optval = ("-{}".format(fseq_opt), fseq_value) + fseq_cmd_chunks.append(fseq_optval) + + # Create the peak calling command + fseq_cmd = build_command(fseq_cmd_chunks) + + # Create the file merge/delete commands. + chrom_peak_files = os.path.join(peak_folder, "*.npf") + merge_chrom_peaks_files = ( + "cat {peakfiles} > {combined_peak_file}" + .format(peakfiles=chrom_peak_files, + combined_peak_file=peak_output_file)) + #delete_chrom_peaks_files = "rm {}".format(chrom_peak_files) + pm.clean_add(chrom_peak_files) + + # Pypiper serially executes the commands. + cmd = [fseq_cmd, merge_chrom_peaks_files, delete_chrom_peaks_files] + + else: + # MACS2 + macs_cmd_chunks = [ + "{} callpeak".format(tools.macs2), + ("-t", peak_input_file), + "-f BED", + ("-g", args.genome_size), + ("--outdir", peak_folder), + ("-n", args.sample_name), + ("-q", param.macs2.q), + ("--shift", param.macs2.shift), + "--nomodel" + ] + # Note: required input file is non-positional ("treatment" file -t) + cmd = build_command(macs_cmd_chunks) + + # Call peaks and report peak count. + pm.run(cmd, peak_output_file, follow=report_peak_count, + container=pm.container) + + # Filter peaks in blacklist. + if os.path.exists(res.blacklist): + filter_peak = os.path.join(peak_folder, args.sample_name + + "_peaks.narrowPeak.rmBlacklist") + cmd = (tools.bedtools + " intersect " + " -a " + peak_output_file + + " -b " + res.blacklist + " -v >" + filter_peak) + + pm.run(cmd, filter_peak, container=pm.container) + + pm.timestamp("### # Calculate fraction of reads in peaks (FRIP)") + + frip = calc_frip(rmdup_bam, peak_output_file, frip_func=ngstk.simple_frip, + pipeline_manager=pm) + pm.report_result("FRIP", frip) + + if args.frip_ref_peaks and os.path.exists(args.frip_ref_peaks): + # Use an external reference set of peaks instead of the peaks called + # from this run + frip_ref = calc_frip(rmdup_bam, args.frip_ref_peaks, + frip_func=ngstk.simple_frip, pipeline_manager=pm) + pm.report_result("FRIP_ref", frip_ref) + + pm.stop_pipeline() + + +if __name__ == '__main__': + pm = None + # TODO: remove once ngstk become less instance-y, more function-y. + ngstk = None + try: + sys.exit(main()) + except KeyboardInterrupt: + print("Pipeline aborted.") + sys.exit(1) diff --git a/pipelines/ATACseq.yaml b/pipelines/pepatac.yaml similarity index 79% rename from pipelines/ATACseq.yaml rename to pipelines/pepatac.yaml index 211c175a..ba2fa6fb 100644 --- a/pipelines/ATACseq.yaml +++ b/pipelines/pepatac.yaml @@ -1,4 +1,4 @@ -# Configuration file for ATACseq pipeline based on pypiper +# PEPATAC configuration file for an ATACseq pipeline based on pypiper # basic tools # public tools @@ -10,15 +10,18 @@ tools: # absolute paths to required tools bowtie2: bowtie2 fastqc: fastqc macs2: macs2 - fseq: fseq picard: ${PICARD} - trimmo: ${TRIMMOMATIC} + skewer: skewer + perl: perl # ucsc tools bedGraphToBigWig: bedGraphToBigWig wigToBigWig: wigToBigWig bigWigCat: bigWigCat - Rscript: Rscript - perl: perl + bedSort: bedSort + # optional tools + fseq: fseq + trimmo: ${TRIMMOMATIC} + Rscript: Rscript # user configure resources: @@ -36,4 +39,4 @@ parameters: # parameters passed to bioinformatic tools, subclassed by tool of: npf # narrowPeak as output format l: 600 # feature length t: 4.0 # "threshold" (standard deviations) - s: 1 # wiggle track step + s: 1 # wiggle track step \ No newline at end of file diff --git a/pipelines/ATACseq_backup.yaml b/pipelines/pepatac_backup.yaml similarity index 96% rename from pipelines/ATACseq_backup.yaml rename to pipelines/pepatac_backup.yaml index c5bb83d3..3eda53c3 100644 --- a/pipelines/ATACseq_backup.yaml +++ b/pipelines/pepatac_backup.yaml @@ -1,4 +1,4 @@ -# configure file for ATACseq pipeline based on pypiper +# PEPATAC configuration file for an ATACseq pipeline based on pypiper #system configure tools: diff --git a/tests/test_build_command.py b/tests/test_build_command.py index fb5dd26c..e5a44b08 100644 --- a/tests/test_build_command.py +++ b/tests/test_build_command.py @@ -9,7 +9,7 @@ TEST_PATH = os.path.dirname(__file__) ATAC_PATH = os.path.join(TEST_PATH, "..", "pipelines") sys.path.append(ATAC_PATH) -from ATACseq import build_command +from pepatac import build_command __author__ = "Vince Reuter" diff --git a/tools/ATAC_Looper_Summary_plot.R b/tools/ATAC_Looper_Summary_plot.R deleted file mode 100644 index e5901796..00000000 --- a/tools/ATAC_Looper_Summary_plot.R +++ /dev/null @@ -1,270 +0,0 @@ -################################################################################################################ -#5/18/17 -#Last Updated 6/22/17 -#Ryan Corces -#ATAC_Looper_Summary_plot.R -# -#This program is meant to plot multiple summary graphs from the summary table made by the Looper summarize command -#NOTES: -#usage: Rscript /path/to/Rscript/ATAC_Looper_Summary_plot.R /path/to/ATACseq_stats_summary.tsv -# -#requirements: ggplot2 -# -################################################################################################ -## Load dependencies -suppressPackageStartupMessages(require(ggplot2)) -suppressPackageStartupMessages(require(gplots)) -suppressPackageStartupMessages(require(grid)) -suppressPackageStartupMessages(require(reshape2)) -################################################################################################ -#Global Variables - -#Minimum TSS score to yield "passing"- This would need to be changed if you used a highly different TSS file -TSS_CUTOFF <- 6 - -#theme for all plots -alignTheme<-theme( - plot.background = element_blank(), - panel.grid.major = element_blank(), - panel.grid.minor = element_blank(), - panel.border = element_rect(colour = "black", fill=NA, size=0.5), - panel.background = element_blank(), - axis.line = element_blank(), - axis.text.x = element_text(face = "plain", color = "black", size = 20, hjust = 0.5), - axis.text.y = element_text(face = "plain", color = "black", size = 20, hjust = 1), - axis.title.x = element_text(face = "plain", color = "black", size = 22, hjust = 0.5, vjust=0.5), - axis.title.y = element_text(face = "plain", color = "black", size = 22, hjust = 0.5), - plot.title = element_text(face="bold", color = "black", size=12, hjust=0.5), - axis.ticks = element_line(size = 0.5), - axis.ticks.length = unit(2, "mm"), - legend.background = element_rect(fill="transparent"), - legend.text = element_text(size=16), - legend.title = element_blank() -) - -#the colors used for the alignment plots. These colors get altered by the code if certain pre-alignments werent performed -alignColors <- data.frame(Unaligned="gray15",AlphaSat="gray45",Alu="gray60",rDNA="gray75",Repeats="grey90",Mito="#F9655D",Other="royalblue1", stringsAsFactors = FALSE) -################################################################################################ -#Functions - -#Taken from https://stackoverflow.com/questions/6461209/how-to-round-up-to-the-nearest-10-or-100-or-x -roundUpNice <- function(x, nice=c(1,2,3,4,5,6,7,8,9,10)) { - if(length(x) != 1) stop("'x' must be of length 1") - 10^floor(log10(x)) * nice[[which(x <= 10^floor(log10(x)) * nice)[[1]]]] -} - -#Taken from https://github.com/baptiste/egg/blob/master/R/set_panel_size.r -set_panel_size <- function(p=NULL, g=ggplotGrob(p), file=NULL, - margin = unit(1,"in"), - width=unit(4, "in"), - height=unit(4, "in")){ - - panels <- grep("panel", g$layout$name) - panel_index_w<- unique(g$layout$l[panels]) - panel_index_h<- unique(g$layout$t[panels]) - nw <- length(panel_index_w) - nh <- length(panel_index_h) - - if(getRversion() < "3.3.0"){ - - # the following conversion is necessary - # because there is no `[<-`.unit method - # so promoting to unit.list allows standard list indexing - g$widths <- grid:::unit.list(g$widths) - g$heights <- grid:::unit.list(g$heights) - - g$widths[panel_index_w] <- rep(list(width), nw) - g$heights[panel_index_h] <- rep(list(height), nh) - - } else { - - g$widths[panel_index_w] <- rep(width, nw) - g$heights[panel_index_h] <- rep(height, nh) - - } - - if(!is.null(file)) - ggsave(file, g, limitsize = FALSE, - width = convertWidth(sum(g$widths) + margin, - unitTo = "in", valueOnly = TRUE), - height = convertHeight(sum(g$heights) + margin, - unitTo = "in", valueOnly = TRUE)) - - invisible(g) -} - -################################################################################################ -args <- commandArgs(TRUE) -summaryFile <- args[1] - -setwd(dirname(summaryFile)) - -if (is.null(summaryFile)) { - # print usage information if summary.tsv file is not provided - cat("\n Usage: Rscript /path/to/ATAC_Looper_Summary_plot.R /path/to/ATACseq_stats_summary.tsv \n") -} - -write(paste("\nGenerating plots in pdf format with ", summaryFile, sep=""), stdout()) - -#read in stats summary file -stats <- read.delim(summaryFile, header=TRUE, row.names=NULL, as.is=TRUE, check.names=FALSE, sep="\t", quote="") - -#Fix absent values in table to zero -stats[is.na(stats)] <- 0 -stats[is.null(stats)] <- 0 -stats[stats==""] <- 0 -stats$Picard_est_lib_size[stats$Picard_est_lib_size=="Unknown"] <- 0 -#if pre-alignments were not performed, add columns containing 0 to avoid missing values -#AlphaSat -if (is.null(stats$Aligned_reads_human_alphasat)) -{ - stats <- cbind.data.frame(stats,Aligned_reads_human_alphasat=rep(0,nrow(stats))) - alignColors$AlphaSat <- "white" -} -if (is.null(stats$Alignment_rate_human_alphasat)) -{ - stats <- cbind.data.frame(stats,Alignment_rate_human_alphasat=rep(0,nrow(stats))) -} -#rCDRS -if (is.null(stats$Aligned_reads_rCRSd)) -{ - stats <- cbind.data.frame(stats,Aligned_reads_rCRSd=rep(0,nrow(stats))) - alignColors$Mito <- "white" -} -if (is.null(stats$Alignment_rate_rCRSd)) -{ - stats <- cbind.data.frame(stats,Alignment_rate_rCRSd=rep(0,nrow(stats))) -} -#Alu -if (is.null(stats$Aligned_reads_human_alu)) -{ - stats <- cbind.data.frame(stats,Aligned_reads_human_alu=rep(0,nrow(stats))) - alignColors$Alu <- "white" -} -if (is.null(stats$Alignment_rate_human_alu)) -{ - stats <- cbind.data.frame(stats,Alignment_rate_human_alu=rep(0,nrow(stats))) -} -#rDNA -if (is.null(stats$Aligned_reads_human_rDNA)) -{ - stats <- cbind.data.frame(stats,Aligned_reads_human_rDNA=rep(0,nrow(stats))) - alignColors$rDNA <- "white" -} -if (is.null(stats$Alignment_rate_human_rDNA)) -{ - stats <- cbind.data.frame(stats,Alignment_rate_human_rDNA=rep(0,nrow(stats))) -} -#Repeats -if (is.null(stats$Aligned_reads_human_repeats)) -{ - stats <- cbind.data.frame(stats,Aligned_reads_human_repeats=rep(0,nrow(stats))) - alignColors$Repeats <- "white" -} -if (is.null(stats$Alignment_rate_human_repeats)) -{ - stats <- cbind.data.frame(stats,Alignment_rate_human_repeats=rep(0,nrow(stats))) -} - -################################################################################################ -#ALIGN RAW PLOT -alignRaw <- cbind.data.frame(sample=stats$sample_name, - Unaligned=(stats$Fastq_reads - stats$Aligned_reads_human_alphasat - stats$Aligned_reads_human_alu - stats$Aligned_reads_human_rDNA - stats$Aligned_reads_human_repeats - stats$Aligned_reads_rCRSd - stats$Aligned_reads), - AlphaSat=stats$Aligned_reads_human_alphasat, - Alu=stats$Aligned_reads_human_alu, - rDNA=stats$Aligned_reads_human_rDNA, - Repeats=stats$Aligned_reads_human_repeats, - Mito=stats$Aligned_reads_rCRSd, - Other=stats$Aligned_reads) - -alignRaw$sample <- factor(alignRaw$sample, levels = alignRaw$sample) - -meltAlignRaw <- melt (alignRaw, id.vars = "sample") -maxReads <- max(rowSums(alignRaw[,2:ncol(alignRaw)])) -upperLimit <- roundUpNice(maxReads/1000000) -chartHeight = ( length(unique(alignRaw$sample))) * 0.75 - -alignRawPlot <- ggplot(meltAlignRaw, aes(x = sample, y = as.numeric(value)/1000000)) + - geom_col(aes(fill = variable), colour="black", size = 0.25, width=0.8) + labs(x = "", y = "Number of reads (M)") + - scale_x_discrete(limits = rev(levels(meltAlignRaw$sample))) + - scale_y_continuous(limits = c(0,upperLimit), expand=c(0,0)) + - scale_fill_manual(values=c(Unaligned=alignColors$Unaligned,AlphaSat=alignColors$AlphaSat,Alu=alignColors$Alu,rDNA=alignColors$rDNA,Repeats=alignColors$Repeats,Mito=alignColors$Mito,Other=alignColors$Other)) + - # scale_fill_manual(values=c(AlphaSat="gray45",Alu="gray60",rDNA="gray75",Repeats="grey90",Mito="#F9655D",Other="royalblue1")) + - # scale_fill_manual(values=c(AlphaSat="#02AD74",Alu="#8027AF",rDNA="#F4912F",Repeats="grey78",Mito="#F9655D",Other="royalblue1")) + - coord_flip() + alignTheme - -set_panel_size(alignRawPlot, file=gsub(pattern=".tsv", replacement=".alignmentRaw.pdf", x=summaryFile), width=unit(8,"inches"), height=unit(chartHeight,"inches")) -################################################################################################ -#ALIGN PERCENT PLOT -alignPercent <- cbind.data.frame(sample=stats$sample_name, - Unaligned=(100 - stats$Alignment_rate_human_alphasat - stats$Alignment_rate_human_alu - stats$Alignment_rate_human_rDNA - stats$Alignment_rate_human_repeats - stats$Alignment_rate_rCRSd - stats$Alignment_rate), - AlphaSat=stats$Alignment_rate_human_alphasat, - Alu=stats$Alignment_rate_human_alu, - rDNA=stats$Alignment_rate_human_rDNA, - Repeats=stats$Alignment_rate_human_repeats, - Mito=stats$Alignment_rate_rCRSd, - Other=stats$Alignment_rate) -# alignPercent <- cbind.data.frame(Unaligned=(100 - rowSums(alignPercent[,2:ncol(alignPercent)])), alignPercent) -alignPercent$sample <- factor(alignPercent$sample, levels = alignPercent$sample) - -meltAlignPercent <- melt (alignPercent, id.vars = "sample") -upperLimit <- 103 -chartHeight = ( length(unique(alignPercent$sample))) * 0.75 - -alignPercentPlot <- ggplot(meltAlignPercent, aes(x = sample, y = as.numeric(value))) + - geom_col(aes(fill = variable), colour="black", size = 0.25, width=0.8) + labs(x = "", y = "Percent of reads") + - scale_x_discrete(limits = rev(levels(meltAlignPercent$sample))) + - scale_y_continuous(limits = c(0,upperLimit), expand=c(0,0)) + - scale_fill_manual(values=c(Unaligned=alignColors$Unaligned,AlphaSat=alignColors$AlphaSat,Alu=alignColors$Alu,rDNA=alignColors$rDNA,Repeats=alignColors$Repeats,Mito=alignColors$Mito,Other=alignColors$Other)) + - # scale_fill_manual(values=c(Unaligned="gray30",AlphaSat="gray45",Alu="gray60",rDNA="gray75",Repeats="grey90",Mito="#F9655D",Other="royalblue1")) + - # scale_fill_manual(values=c(Unaligned="gray28",AlphaSat="#02AD74",Alu="#8027AF",rDNA="#F4912F",Repeats="grey78",Mito="#F9655D",Other="royalblue1")) + - coord_flip() + alignTheme - -set_panel_size(alignPercentPlot, file=gsub(pattern=".tsv", replacement=".alignmentPercent.pdf", x=summaryFile), width=unit(8,"inches"), height=unit(chartHeight,"inches")) -################################################################################################ -#TSS PLOT -#establish red/green color scheme -redMin = 0 -redMax = TSS_CUTOFF-0.01 -redBreaks = seq(redMin,redMax,0.01) -redColors = colorpanel(length(redBreaks), "#AF0000","#E40E00","#FF7A6A") -greenMin = TSS_CUTOFF -greenMax = 30 -greenBreaks = seq(greenMin,greenMax,0.01) -greenColors = colorpanel(length(greenBreaks)-1, "#B4E896","#009405","#003B00") -TSScolors <- c(redColors,greenColors) - -#Organize data for plotting -TSSscore <- cbind.data.frame(sample=stats$sample_name, TSS=round(stats$TSS_Score, digits = 2), QCcolor=(TSScolors[round(stats$TSS_Score+0.01, digits = 2)*100])) -maxTSS <- max(stats$TSS_Score, na.rm=TRUE) -upperLimit <- roundUpNice(maxTSS) -chartHeight = ( length(unique(TSSscore$sample))) * 0.75 - -TSSscore$sample <- factor(TSSscore$sample, levels = TSSscore$sample) - -TSSPlot <- ggplot(TSSscore, aes(x = sample, y = as.numeric(TSS))) + - geom_bar(colour="black", size = 0.25, width=0.7, stat="identity", fill = rev(TSSscore$QCcolor)) + - geom_hline(yintercept = 6, linetype = 2, color = "grey", size = 0.25) + - labs(x = "", y = "TSS Enrichment Score") + - scale_x_discrete(limits = rev(TSSscore$sample)) + - scale_y_continuous(limits = c(0,upperLimit), expand=c(0,0)) + - coord_flip() + alignTheme - -set_panel_size(TSSPlot, file=gsub(pattern=".tsv", replacement=".TSS_Enrichment.pdf", x=summaryFile), width=unit(8,"inches"), height=unit(chartHeight,"inches")) -################################################################################################ -#LIBRARY SIZE PLOT -PicardLibSize <- cbind.data.frame(sample=stats$sample_name, LibSize=(as.numeric(stats$Picard_est_lib_size)/1000000)) -maxSize <- max(PicardLibSize$LibSize) -upperLimit <- roundUpNice(maxSize) -chartHeight = ( length(unique(PicardLibSize$sample))) * 0.75 - -PicardLibSize$sample <- factor(PicardLibSize$sample, levels = PicardLibSize$sample) - -LibSizePlot <- ggplot(PicardLibSize, aes(x = sample, y = as.numeric(LibSize))) + - geom_col(colour="black", size = 0.25, width=0.8, fill = "royalblue1") + labs(x = "", y = "Picard Library Size (M)") + - scale_x_discrete(limits = rev(levels(PicardLibSize$sample))) + - scale_y_continuous(limits = c(0,upperLimit), expand=c(0,0)) + - coord_flip() + alignTheme - -set_panel_size(LibSizePlot, file=gsub(pattern=".tsv", replacement=".LibSize.pdf", x=summaryFile), width=unit(8,"inches"), height=unit(chartHeight,"inches")) -################################################################################################ diff --git a/tools/ATAC_Rscript_TSSenrichmentPlot_pyPiper.R b/tools/ATAC_Rscript_TSSenrichmentPlot_pyPiper.R deleted file mode 100644 index ff4e1a85..00000000 --- a/tools/ATAC_Rscript_TSSenrichmentPlot_pyPiper.R +++ /dev/null @@ -1,96 +0,0 @@ -################################################################################################################ -#5/18/17 -#Last Updated 6/22/17 -#Ryan Corces -#ATAC_Rscript_TSSenrichmentPlot_pyPiper.R -# -#This program is meant to plot multiple TSS enrichments on the same graph in R -#NOTES: -#usage: Rscript /path/to/Rscript/ATAC_Rscript_TSSenrichmentPlot.R --TSSfile /path/to/file.TssEnrichment --outputType < pdf (default) / png > -# -#requirements: ggplot2 -# -################################################################################################ -#Global Variables -TSS_CUTOFF <- 6 # Minimum TSS score to yield "passing"- This would need to be changed if you used a highly different TSS file -################################################################################################ -## Load dependencies -# suppressPackageStartupMessages(require(optparse)) -suppressPackageStartupMessages(require(ggplot2)) -################################################################################################ -## uses optparse package to read in command line arguments -# option_list <- list( -# make_option(c("--TSSfile"), action="store", default=NULL), -# make_option(c("--outputType"), action="store", default="pdf") -# ) - -# opt = parse_args(OptionParser(option_list=option_list)) - -args <- commandArgs(TRUE) - -TSSfile <- args[1] -outputType <- args[2] - -if (is.null(TSSfile)) { - # print usage information if .TssEnrichment file is not provided - cat("\n Usage: Rscript /path/to/ATAC_Rscript_TSSenrichmentPlot.R /path/to/file.TssEnrichment outputType \n - where outputType is either pdf or png - .TssEnrichment file is required \n - output Type is optional (default pdf) \n\n") -} - -write(paste("\nGenerating TSS plot in ",outputType," format with ", TSSfile, sep=""), stdout()) - -t1<-theme( - plot.background = element_blank(), - panel.grid.major = element_blank(), - panel.grid.minor = element_blank(), - panel.border = element_rect(colour = "black", fill=NA, size=0.5), - panel.background = element_blank(), - axis.line = element_blank(), - axis.text.x = element_text(face = "plain", color = "black", size = 20, hjust = 0.5), - axis.text.y = element_text(face = "plain", color = "black", size = 20, hjust = 0.5), - axis.title.x = element_text(face = "plain", color = "black", size = 22, hjust = 0.5, vjust=0.5), - axis.title.y = element_text(face = "plain", color = "black", size = 22, hjust = 0.5), - plot.title = element_text(face="bold", color = "black", size=12, hjust=0.5), - legend.position="none", - axis.ticks.length = unit(2, "mm") -) - -insertionsMat <- read.table(TSSfile, header=FALSE, row.names=NULL, as.is=TRUE, check.names=FALSE) -normTSS <- insertionsMat / mean(insertionsMat[1:200,]) -colnames(normTSS) <- c("score") -TSSscore <- round(mean(normTSS[1950:2050,]),1) - -lineColor <- "red2" -if (TSSscore > TSS_CUTOFF) -{ - lineColor <- "springgreen4" -} - -if (outputType == "png") { - png(filename = gsub(pattern=".TssEnrichment", replacement=".TssEnrichment.png", x=TSSfile), width = 480, height = 480) - pre <- ggplot(normTSS, aes(x=(as.numeric(rownames(normTSS))-2000), y=score, group=1, colour="black")) + - geom_hline(yintercept = 6, linetype = 2, color = "grey", size = 0.25) + - geom_smooth(method="loess", span=0.02, se=FALSE, colour=lineColor) + labs(x = "Distance from TSS (bp)", y = "TSS Enrichment Score") - pre + t1 + scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + - coord_cartesian(xlim=c(-2300,2300), ylim=c(0,32)) + #ylim(0,30) + xlim(-2100,2100) + - annotate("rect", xmin=1200, xmax=2300, ymin=27, ymax=32, fill="gray95", size = 0.5) + - annotate("text", x=1750, y=31, label="TSS Score", fontface = 1, size=6, hjust=0.5) + - annotate("text", x=1750, y=29, label=TSSscore, fontface = 2, size=10, hjust=0.5) -} else { - pdf(file = gsub(pattern=".TssEnrichment", replacement=".TssEnrichment.pdf", x=TSSfile), width= 7, height = 7, useDingbats=F) - pre <- ggplot(normTSS, aes(x=(as.numeric(rownames(normTSS))-2000), y=score, group=1, colour="black")) + - geom_hline(yintercept = 6, linetype = 2, color = "grey", size = 0.25) + - geom_smooth(method="loess", span=0.02, se=FALSE, colour=lineColor) + labs(x = "Distance from TSS (bp)", y = "TSS Enrichment Score") - pre + t1 + scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + - coord_cartesian(xlim=c(-2300,2300), ylim=c(0,32)) + #ylim(0,30) + xlim(-2100,2100) + - annotate("rect", xmin=1200, xmax=2300, ymin=27, ymax=32, fill="gray95", size = 0.5) + - annotate("text", x=1750, y=31, label="TSS Score", fontface = 1, size=6, hjust=0.5) + - annotate("text", x=1750, y=29, label=TSSscore, fontface = 2, size=10, hjust=0.5) -} -invisible(dev.off()) - -write("Completed!\n", stdout()) - -################################################################################################ \ No newline at end of file diff --git a/tools/PEPATAC_TSSenrichmentPlot.R b/tools/PEPATAC_TSSenrichmentPlot.R new file mode 100644 index 00000000..af7513e9 --- /dev/null +++ b/tools/PEPATAC_TSSenrichmentPlot.R @@ -0,0 +1,135 @@ +############################################################################### +#5/18/17 +#Last Updated 06/25/18 +#Original Author: Ryan Corces +#Updated by: Jason Smith +#PEPATAC_TSSenrichmentPlot.R +# +#This program is meant to plot multiple TSS enrichments on the same graph in R +#Produces both pdf and png for easy html embedding +# +#NOTES: +#usage: Rscript /path/to/Rscript/PEPATAC_TSSenrichmentPlot.R --TSSfile +# /path/to/file.TssEnrichment +# +#requirements: ggplot2 +# +############################################################################### +#Global Variables +# Minimum TSS score to yield "passing" -- +# This would need to be changed if you used a highly different TSS file +TSS_CUTOFF <- 6 +############################################################################### +## Load dependencies +# suppressPackageStartupMessages(require(optparse)) +warnSetting <- getOption("warn") +# Briefly ignore any package warnings +options(warn = -1) +suppressPackageStartupMessages(require(ggplot2)) +options(warn = warnSetting) +############################################################################### +## uses optparse package to read in command line arguments +# option_list <- list( +# make_option(c("--TSSfile"), action="store", default=NULL), +# make_option(c("--outputType"), action="store", default="pdf") +# ) + +# opt = parse_args(OptionParser(option_list=option_list)) + +args <- commandArgs(TRUE) + +TSSfile <- args[1] +#outputType <- args[2] + +if (is.null(TSSfile)) { + # print usage information if .TssEnrichment file is not provided + cat("\n Usage: Rscript /path/to/PEPATAC_TSSenrichmentPlot.R + /path/to/file.TssEnrichment \n + .TssEnrichment file is required \n\n") +} + +#write(paste("\nGenerating TSS plot in ",outputType," format with ", +# TSSfile, sep=""), stdout()) +write(paste("\nGenerating TSS plot with ", TSSfile, sep=""), stdout()) + +t1<-theme( + plot.background = element_blank(), + panel.grid.major = element_blank(), + panel.grid.minor = element_blank(), + panel.border = element_rect(colour = "black", fill=NA, size=0.5), + panel.background = element_blank(), + axis.line = element_blank(), + axis.text.x = element_text(face = "plain", color = "black", + size = 20, hjust = 0.5), + axis.text.y = element_text(face = "plain", color = "black", + size = 20, hjust = 0.5), + axis.title.x = element_text(face = "plain", color = "black", size = 22, + hjust = 0.5, vjust=0.5), + axis.title.y = element_text(face = "plain", color = "black", size = 22, + hjust = 0.5), + plot.title = element_text(face="bold", color = "black", size=12, + hjust=0.5), + legend.position="none", + axis.ticks.length = unit(2, "mm") +) + +insertionsMat <- read.table(TSSfile, header=FALSE, row.names=NULL, + as.is=TRUE, check.names=FALSE) +normTSS <- insertionsMat / mean(insertionsMat[1:200,]) +colnames(normTSS) <- c("score") +TSSscore <- round(mean(normTSS[1950:2050,]),1) +if (is.nan(TSSscore)) { + message(paste("\nNaN produced. Check ", TSSfile, "\n", sep="")) + quit() +} +lineColor <- "red2" +if (TSSscore > TSS_CUTOFF) +{ + lineColor <- "springgreen4" +} + +png(filename = gsub(pattern=".TssEnrichment", replacement=".TssEnrichment.png", + x=TSSfile), width = 480, height = 480) +pre <- ggplot(normTSS, aes(x=(as.numeric(rownames(normTSS))-2000), y=score, + group=1, colour="black")) + + geom_hline(yintercept = 6, linetype = 2, + color = "grey", size = 0.25) + + geom_smooth(method="loess", span=0.02, + se=FALSE, colour=lineColor) + + labs(x = "Distance from TSS (bp)", y = "TSS Enrichment Score") +pre + t1 + scale_x_continuous(expand=c(0,0)) + + scale_y_continuous(expand=c(0,0)) + + coord_cartesian(xlim=c(-2300,2300), ylim=c(0,32)) + + #ylim(0,30) + xlim(-2100,2100) + + annotate("rect", xmin=1200, xmax=2300, ymin=27, ymax=32, + fill="gray95", size = 0.5) + + annotate("text", x=1750, y=31, label="TSS Score", fontface = 1, + size=6, hjust=0.5) + + annotate("text", x=1750, y=29, label=TSSscore, fontface = 2, + size=10, hjust=0.5) + +pdf(file = gsub(pattern=".TssEnrichment", replacement=".TssEnrichment.pdf", + x=TSSfile), width= 7, height = 7, useDingbats=F) +pre <- ggplot(normTSS, aes(x=(as.numeric(rownames(normTSS))-2000), y=score, + group=1, colour="black")) + + geom_hline(yintercept = 6, linetype = 2, + color = "grey", size = 0.25) + + geom_smooth(method="loess", span=0.02, + se=FALSE, colour=lineColor) + + labs(x = "Distance from TSS (bp)", y = "TSS Enrichment Score") +pre + t1 + scale_x_continuous(expand=c(0,0)) + + scale_y_continuous(expand=c(0,0)) + + coord_cartesian(xlim=c(-2300,2300), ylim=c(0,32)) + + #ylim(0,30) + xlim(-2100,2100) + + annotate("rect", xmin=1200, xmax=2300, ymin=27, ymax=32, + fill="gray95", size = 0.5) + + annotate("text", x=1750, y=31, label="TSS Score", fontface = 1, + size=6, hjust=0.5) + + annotate("text", x=1750, y=29, label=TSSscore, fontface = 2, + size=10, hjust=0.5) + +invisible(dev.off()) + +write("Completed!\n", stdout()) + +############################################################################### diff --git a/tools/PEPATAC_summarizer.R b/tools/PEPATAC_summarizer.R new file mode 100755 index 00000000..94b61ad0 --- /dev/null +++ b/tools/PEPATAC_summarizer.R @@ -0,0 +1,510 @@ +#! /usr/bin/env Rscript +############################################################################### +#5/18/17 +#Last Updated 06/21/2018 +#Original Author: Ryan Corces +#Updated by: Jason Smith +#PEPATAC_summarizer.R +# +#This program is meant to plot multiple summary graphs from the summary table +#made by the Looper summarize command +# +#NOTES: +#usage: Rscript /path/to/Rscript/PEPATAC_summarizer.R +# /path/to/project_config.yaml +# +#requirements: argparser, ggplot2, gplots, grid, data.table, pepr +# +############################################################################### + +##### LOAD ARGUMENTPARSER ##### +loadLibrary <- tryCatch ( + { + suppressWarnings(suppressPackageStartupMessages(library(argparser))) + }, + error=function(e) { + message("Error: Install the \"argparser\"", + " library before proceeding.") + return(NULL) + }, + warning=function(e) { + message(e) + return(TRUE) + } +) +if (length(loadLibrary)!=0) { + suppressWarnings(library(argparser)) +} else { + quit() +} + +# Create a parser +p <- arg_parser("Produce ATACseq Summary Plots") + +# Add command line arguments +p <- add_argument(p, "config", + help="PEPATAC project_config.yaml") + +# Parse the command line arguments +argv <- parse_args(p) + +############################################################################### + +##### LOAD DEPENDENCIES ##### +required_libraries <- c("ggplot2", "gplots", "grid", "data.table", "pepr") +for (i in required_libraries) { + loadLibrary <- tryCatch ( + { + suppressPackageStartupMessages( + suppressWarnings(library(i, character.only=TRUE))) + }, + error=function(e) { + message("Error: Install the \"", i, + "\" library before proceeding.") + return(NULL) + }, + warning=function(e) { + message(e) + return(1) + } + ) + if (length(loadLibrary)!=0) { + suppressWarnings(library(i, character.only=TRUE)) + } else { + quit() + } +} + +############################################################################### +##### Global Variables ##### + +# Minimum TSS score to yield "passing"- This would need to be changed if +# you used a highly different TSS file. +TSS_CUTOFF <- 6 + +# Theme for all plots +alignTheme<-theme( + plot.background = element_blank(), + panel.grid.major = element_blank(), + panel.grid.minor = element_blank(), + panel.border = element_rect(colour = "black", fill=NA, size=0.5), + panel.background = element_blank(), + axis.line = element_blank(), + axis.text.x = element_text(face = "plain", color = "black", + size = 20, hjust = 0.5), + axis.text.y = element_text(face = "plain", color = "black", + size = 20, hjust = 1), + axis.title.x = element_text(face = "plain", color = "black", + size = 22, hjust = 0.5, vjust=0.5), + axis.title.y = element_text(face = "plain", color = "black", + size = 22, hjust = 0.5), + plot.title = element_text(face="bold", color = "black", + size=12, hjust=0.5), + axis.ticks = element_line(size = 0.5), + axis.ticks.length = unit(2, "mm"), + legend.background = element_rect(fill="transparent"), + legend.text = element_text(size=16), + legend.title = element_blank() +) + +############################################################################### +##### Functions ##### + +# Taken from https://stackoverflow.com/questions/6461209/how-to-round-up-to-the-nearest-10-or-100-or-x +roundUpNice <- function(x, nice=c(1,2,3,4,5,6,7,8,9,10)) { + if(length(x) != 1) stop("'x' must be of length 1") + 10^floor(log10(x)) * nice[[which(x <= 10^floor(log10(x)) * nice)[[1]]]] +} + +# Taken from https://github.com/baptiste/egg/blob/master/R/set_panel_size.r +set_panel_size <- function(p=NULL, g=ggplotGrob(p), file=NULL, + margin=unit(1, "in"), + width=unit(4, "in"), + height=unit(4, "in")){ + + panels <- grep("panel", g$layout$name) + panel_index_w <- unique(g$layout$l[panels]) + panel_index_h <- unique(g$layout$t[panels]) + nw <- length(panel_index_w) + nh <- length(panel_index_h) + + if(getRversion() < "3.3.0"){ + + # the following conversion is necessary + # because there is no `[<-`.unit method + # so promoting to unit.list allows standard list indexing + g$widths <- grid:::unit.list(g$widths) + g$heights <- grid:::unit.list(g$heights) + + g$widths[panel_index_w] <- rep(list(width), nw) + g$heights[panel_index_h] <- rep(list(height), nh) + + } else { + + g$widths[panel_index_w] <- rep(width, nw) + g$heights[panel_index_h] <- rep(height, nh) + + } + + if(!is.null(file)) + ggsave(file, g, limitsize = FALSE, + width=convertWidth(sum(g$widths) + margin, + unitTo="in", valueOnly=TRUE), + height=convertHeight(sum(g$heights) + margin, + unitTo="in", valueOnly=TRUE)) + + invisible(g) +} + +# Helper function to build a file path to the correct output folder using a +# specified suffix +buildFilePath = function(suffix, pep=prj) { + file.path(config(pep)$metadata$output_dir, "summary", + paste0(config(pep)$name, suffix)) +} + +############################################################################### +##### Main ##### + +# Identify the project configuration file +configFile <- argv$config +prj = Project(configFile) + +# Build the stats summary file path +summaryFile <- file.path(config(prj)$metadata$output_dir, + paste0(config(prj)$name, "_stats_summary.tsv")) + +# Produce output directory +dir.create( + suppressMessages( + file.path(config(prj)$metadata$output_dir, "summary")), + showWarnings = FALSE) + +# read in stats summary file +if (file.exists(summaryFile)) { + stats <- suppressWarnings(fread(summaryFile, header=TRUE, + check.names=FALSE, sep="\t", quote="")) +} else { + message("PEPATAC_summarizer.R was unable to find the summary file.") + quit() +} + +# Check for prealignments +fileName <- config(prj)$metadata$sample_annotation +if (file.exists(fileName)) { + annotation <- fread(fileName, header=TRUE, check.names=FALSE, + sep=",", quote="") + organisms <- unique(annotation$organism) + prealignments <- list() + # Get prealignments for each organism + for (i in 1:length(organisms)) { + pre <- config(prj)$implied_columns$organism[[organisms[i]]] + prealignments <- c(prealignments, strsplit(pre$prealignments, "\\s+")) + } +} + +# confirm the prealignments exist in stats_summary.tsv +for (i in 1:length(unlist(prealignments))) { + # get number of prealignments in stats_summary.tsv file + if(length(grep("Aligned_reads_.*", colnames(stats))) != length(unlist(prealignments))){ + errorMessage <- paste("PEPATAC summarizer found additional ", + "prealignments in ", + paste(config(prj)$name, "_stats_summary.tsv", sep=""), + "\nConfirm the prealignments you performed.", + sep="", collapse="\n") + stop(errorMessage) + } else if (!(paste("Aligned_reads", unlist(prealignments)[i], sep="_") + %in% colnames(stats))) { + errorMessage <- paste(unlist(prealignments)[i], + " is not present in ", + paste(config(prj)$name, "_stats_summary.tsv", sep=""), + "\nConfirm the prealignments you performed.", + sep="", collapse="\n") + stop(errorMessage) + } +} + +message("Generating plots in both png and pdf format with ", summaryFile) + +# Set absent values in table to zero +stats[is.na(stats)] <- 0 +stats[is.null(stats)] <- 0 +stats[stats==""] <- 0 +stats$Picard_est_lib_size[stats$Picard_est_lib_size=="Unknown"] <- 0 + +############################################################################### +##### ALIGN RAW PLOT ##### +Unaligned <- stats$Fastq_reads - stats$Aligned_reads +for (i in 1:length(unlist(prealignments))) { + Unaligned <- Unaligned - + stats[, (paste("Aligned_reads", unlist(prealignments)[i], sep="_")), + with=FALSE][[1]] +} + +alignRaw <- tryCatch( + { + data.table(sample = stats$sample_name, + Unaligned = as.integer(Unaligned), + Duplicates = as.integer(stats$Duplicate_reads)) + }, + error=function(e) { + message("The summary file values for Unaligned or Duplicate reads", + " are missing.") + message("Here's the original error message: ", e) + return(NULL) + }, + warning=function(e) { + message("The summary file contains missing values for the number", + " of Unaligned or Duplicate_reads.") + message("Here's the original error message: ", e) + return(NULL) + } +) + +if (is.null(alignRaw)) { + quit() +} + +# Split counts based on genome name +genomeNames <- unique(stats$Genome) +for (i in 1:length(genomeNames)) { + rowPos <- grep(genomeNames[i], stats$Genome) + readCount <- rep(0,nrow(stats)) + reads <- stats$Aligned_reads[stats$Genome==genomeNames[i]] + for (j in 1:length(reads)) { + readCount[rowPos[j]] <- reads[j] + } + alignRaw[, (genomeNames[i]) := as.integer(readCount)] +} + +for (i in 1:length(unlist(prealignments))) { + alignRaw[, unlist(prealignments)[i] := + stats[, (paste("Aligned_reads", unlist(prealignments)[i], sep="_")), + with=FALSE][[1]]] +} + +setcolorder(alignRaw, c("sample", "Unaligned", paste(unlist(prealignments)), + "Duplicates", paste(unique(stats$Genome)))) + +alignRaw$sample <- factor(alignRaw$sample, levels = alignRaw$sample) + +meltAlignRaw <- melt(alignRaw, id.vars = "sample") +maxReads <- max(rowSums(alignRaw[,2:ncol(alignRaw)])) +upperLimit <- roundUpNice(maxReads/1000000) +chartHeight <- (length(unique(alignRaw$sample))) * 0.75 + +plotColors <- data.table(Unaligned="gray15") +moreColors <- colorpanel(length(unlist(prealignments)), + low="#FFE595", mid="#F6CAA6", high="#F6F2A6") +for (i in 1:length(unlist(prealignments))) { + plotColors[, unlist(prealignments)[i] := moreColors[i]] +} +plotColors[, Duplicates := "#FC1E25"] +moreColors <- colorpanel(length(genomeNames), + low="#4876FF", mid="#94D9CE", high="#7648FF") +for (i in 1:length(genomeNames)) { + plotColors[, (genomeNames[i]) := moreColors[i]] +} + +alignRawPlot <- ( + ggplot(meltAlignRaw, aes(x=sample, y=as.numeric(value)/1000000)) + + geom_col(aes(fill=variable), colour="black", size=0.25, width=0.8) + + guides(fill=guide_legend(reverse=TRUE)) + + labs(x="", y="Number of reads (M)") + + scale_x_discrete(limits=rev(levels(meltAlignRaw$sample))) + + scale_y_continuous(limits=c(0,upperLimit), expand=c(0,0)) + + scale_fill_manual(values=paste(plotColors)) + + coord_flip() + + alignTheme) + +# Produce both PDF and PNG +set_panel_size( + alignRawPlot, + file=buildFilePath("_alignmentRaw.pdf", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) +set_panel_size( + alignRawPlot, + file=buildFilePath("_alignmentRaw.png", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) + +############################################################################### +##### ALIGN PERCENT PLOT ##### +Unaligned <- 100 - stats$Alignment_rate +for (i in 1:length(unlist(prealignments))) { + Unaligned <- Unaligned - + stats[, (paste("Alignment_rate", unlist(prealignments)[i], sep="_")), + with=FALSE][[1]] +} +alignPercent <- data.table(sample=stats$sample_name, + Unaligned=Unaligned, + Duplicates=stats$Alignment_rate - + stats$Dedup_alignment_rate) + +# Split percents based on genome name +genomeNames <- unique(stats$Genome) +for (i in 1:length(genomeNames)) { + rowPos <- grep(genomeNames[i], stats$Genome) + readCount <- rep(0,nrow(stats)) + reads <- stats$Dedup_alignment_rate[stats$Genome==genomeNames[i]] + for (j in 1:length(reads)) { + readCount[rowPos[j]] <- reads[j] + } + alignPercent[, (genomeNames[i]) := as.numeric(readCount)] +} +for (i in 1:length(unlist(prealignments))) { + alignPercent[, unlist(prealignments)[i] := + stats[, (paste("Alignment_rate", unlist(prealignments)[i], sep="_")), + with=FALSE][[1]]] +} + +setcolorder(alignPercent, c("sample", "Unaligned", paste(unlist(prealignments)), + "Duplicates", paste(unique(stats$Genome)))) + +alignPercent$sample <- factor(alignPercent$sample, levels=alignPercent$sample) + +meltAlignPercent <- melt(alignPercent, id.vars="sample") +upperLimit <- 103 +chartHeight <- (length(unique(alignPercent$sample))) * 0.75 + +plotColors <- data.table(Unaligned="gray15") +moreColors <- colorpanel(length(unlist(prealignments)), + low="#FFE595", mid="#F6CAA6", high="#F6F2A6") +for (i in 1:length(unlist(prealignments))) { + plotColors[, unlist(prealignments)[i] := moreColors[i]] +} +plotColors[, Duplicates := "#FC1E25"] +moreColors <- colorpanel(length(genomeNames), + low="#4876FF", mid="#94D9CE", high="#7648FF") +for (i in 1:length(genomeNames)) { + plotColors[, (genomeNames[i]) := moreColors[i]] +} + +alignPercentPlot <- ( + ggplot(meltAlignPercent, aes(x=sample, y=as.numeric(value))) + + geom_col(aes(fill=variable), colour="black", size=0.25, width=0.8) + + guides(fill=guide_legend(reverse=TRUE)) + + labs(x="", y="Percent of reads") + + scale_x_discrete(limits=rev(levels(meltAlignPercent$sample))) + + scale_y_continuous(limits=c(0,upperLimit), expand=c(0,0)) + + scale_fill_manual(values=paste(plotColors)) + + coord_flip() + + alignTheme) + +# Produce both PDF and PNG +set_panel_size( + alignPercentPlot, + file=buildFilePath("_alignmentPercent.pdf", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) +set_panel_size( + alignPercentPlot, + file=buildFilePath("_alignmentPercent.png", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) + +############################################################################### +##### TSS PLOT ##### + +# Establish red/green color scheme +redMin <- 0 +redMax <- TSS_CUTOFF-0.01 +redBreaks <- seq(redMin,redMax,0.01) +redColors <- colorpanel(length(redBreaks), "#AF0000","#E40E00","#FF7A6A") +greenMin <- TSS_CUTOFF +greenMax <- 30 +greenBreaks <- seq(greenMin,greenMax,0.01) +greenColors <- colorpanel(length(greenBreaks)-1, "#B4E896","#009405","#003B00") +TSScolors <- c(redColors,greenColors) + +# Organize data for plotting +TSSscore <- tryCatch( + { + cbind.data.frame(sample=stats$sample_name, + TSS=round(stats$TSS_Score, digits=2), + QCcolor=(TSScolors[round(stats$TSS_Score+0.01, digits=2)*100])) + }, + error=function(e) { + message("The summary file value(s) for the TSS score(s)", + " is/are missing.") + message("Here's the original error message: ", e) + return(NULL) + }, + warning=function(e) { + message("The summary file value(s) for the TSS score(s)", + " is/are incomplete.") + message("Here's the original warning message: ", e) + return(NULL) + } +) + +if (is.null(TSSscore)) { + quit() +} + +maxTSS <- max(stats$TSS_Score, na.rm=TRUE) +upperLimit <- roundUpNice(maxTSS) +chartHeight <- (length(unique(TSSscore$sample))) * 0.75 + +TSSscore$sample <- factor(TSSscore$sample, levels=TSSscore$sample) + +TSSPlot <- ggplot( + TSSscore, aes(x=sample, y=as.numeric(TSS))) + + geom_bar(colour="black", size=0.25, width=0.7, stat="identity", + fill=rev(TSSscore$QCcolor)) + + geom_hline(yintercept=6, linetype=2, color="grey", size=0.25) + + labs(x="", y="TSS Enrichment Score") + + scale_x_discrete(limits=rev(TSSscore$sample)) + + scale_y_continuous(limits=c(0,upperLimit), expand=c(0,0)) + + coord_flip() + + alignTheme + +# Produce both PDF and PNG +set_panel_size( + TSSPlot, file=buildFilePath("_TSSEnrichment.pdf", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) +set_panel_size( + TSSPlot, file=buildFilePath("_TSSEnrichment.png", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) + +############################################################################### + +##### LIBRARY SIZE PLOT ##### +if (any(!is.na(stats$Picard_est_lib_size))) { + PicardLibSize <- cbind.data.frame( + sample=stats$sample_name, + LibSize=(as.numeric(stats$Picard_est_lib_size)/1000000)) + maxSize <- max(PicardLibSize$LibSize) + upperLimit <- roundUpNice(maxSize) + chartHeight <- (length(unique(PicardLibSize$sample))) * 0.75 + + PicardLibSize$sample <- factor(PicardLibSize$sample, + levels = PicardLibSize$sample) + + LibSizePlot <- ggplot(PicardLibSize, + aes(x = sample, y = as.numeric(LibSize))) + + geom_col(colour="black", size = 0.25, width=0.8, + fill = "royalblue1") + + labs(x = "", y = "Picard Library Size (M)") + + scale_x_discrete(limits = rev(levels(PicardLibSize$sample))) + + scale_y_continuous(limits = c(0,upperLimit), expand=c(0,0)) + + coord_flip() + + alignTheme + + # Produce both PDF and PNG + set_panel_size(LibSizePlot, + file=buildFilePath("_LibSize.pdf", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) + set_panel_size(LibSizePlot, + file=buildFilePath("_LibSize.png", prj), + width=unit(8,"inches"), + height=unit(chartHeight,"inches")) +} else { + quit() +} + +############################################################################### \ No newline at end of file diff --git a/tools/bamSitesToWig.py b/tools/bamSitesToWig.py index 2caf2363..1037e9bd 100755 --- a/tools/bamSitesToWig.py +++ b/tools/bamSitesToWig.py @@ -2,7 +2,7 @@ __author__ = "Nathan C. Sheffield" __credits__ = [] -__license__ = "GPL3" +__license__ = "BSD2" __version__ = "0.1" __email__ = "nathan@code.databio.org" diff --git a/tools/extract_post_dup_aligned_reads.awk b/tools/extract_post_dup_aligned_reads.awk new file mode 100644 index 00000000..a8456565 --- /dev/null +++ b/tools/extract_post_dup_aligned_reads.awk @@ -0,0 +1,28 @@ +# This script reads a picard metrics file and calculates +# the number of aligned reads AFTER duplicate removal. run it like: +# awk -F'\t' -f extract_post_dup_aligned_reads.awk metrics_file.txt + +/## METRICS CLASS/ { + for(i=1; i<=2; i++) { + getline; + c=-1; + d=-1; + # Get the columns of interest + for(j=1; j<=NF; j++) { + if ($j == "READ_PAIRS_EXAMINED") c=j + if ($j == "READ_PAIR_DUPLICATES") d=j + } + + if (c != -1 && d != -1) { + while(getline && $0 != "") { + if ($c == -1 || $d == -1) { + print "Unknown" + } else { + # Calculate total number of reads remaining after duplicate removal + t = ($c-$d)*2 + } + } + } + } + print t +} diff --git a/tools/fragment_length_dist.R b/tools/fragment_length_dist.R index 25f17c22..f5f004d0 100755 --- a/tools/fragment_length_dist.R +++ b/tools/fragment_length_dist.R @@ -1,19 +1,58 @@ -args <- commandArgs(TRUE) +############################################################################### +#Last Updated 06/05/18 +#Original Author: ? +#Updated by: Jason Smith +#fragment_length_dist.R +# +#This program is meant to plot the fragment length distribution for a sample +#Produces both pdf and png for easy html embedding +# +#NOTES: +# usage: Rscript /path/to/Rscript/fragment_length_distribution.R +# fragL fragL_count fragL_dis1 fragL_dis2 +# +# param fragL: infile containing single column of fragment lengths +# type fragL: str +# param fragL_count: counts of each fragment length identified +# type fragL_count: str +# param fragL_dis1: pdf filename +# type fragL_dis1: str +# param fragL_dis2: fragment length distribution stats file +# type fragL_dis2: str +# +#requirements: none +# +############################################################################### -infile <- args[1] +args <- commandArgs(TRUE) +infile <- args[1] infile_sort <- args[2] outfile_pdf <- args[3] +outfile_png <- gsub('pdf', 'png', outfile_pdf) outfile_txt <- args[4] -pdf(outfile_pdf) -dat <- read.table(infile_sort) +dat <- read.table(infile_sort) dat1 <- dat[dat$V2<=600,] -tmp <- seq(1:(dat1[1,2]-1)) +tmp <- seq(1:(dat1[1,2]-1)) dat0 <- data.frame(V1=rep(0,length(tmp)),V2=tmp) dat2 <- rbind(dat0, dat1) -plot(dat1$V2, dat1$V1, type="l",col=2,lwd=2, xlab="Read length", ylab="Read counts", main="Insert size distribution") + +# Save plot to pdf file +pdf(outfile_pdf) +plot(dat1$V2, dat1$V1, type="l",col=2,lwd=2, xlab="Read length", + ylab="Read counts", main="Insert size distribution") + +# Save plot to png file +png(outfile_png) +plot(dat1$V2, dat1$V1, type="l",col=2,lwd=2, xlab="Read length", + ylab="Read counts", main="Insert size distribution") + invisible(dev.off()) -dat <- read.table(infile) -summ <- data.frame(Min=min(dat$V1), Max=max(dat$V1), Median=median(dat$V1), Mean=mean(dat$V1), Stdev=sd(dat$V1)) +dat <- read.table(infile) +summ <- data.frame(Min=min(dat$V1), Max=max(dat$V1), Median=median(dat$V1), + Mean=mean(dat$V1), Stdev=sd(dat$V1)) +# Write summary table to stats file write.table(summ, file=outfile_txt, row.names=F, quote=F, sep="\t") + +############################################################################### diff --git a/tools/pyTssEnrichment.py b/tools/pyTssEnrichment.py index 0e3c7748..bca0b885 100755 --- a/tools/pyTssEnrichment.py +++ b/tools/pyTssEnrichment.py @@ -4,7 +4,7 @@ # Modified from pyMakeVplot.py: Jason Buenrostro # Last updated 7/16/17: Vince Reuter # -# Dependencies: Script requires ATAC_Rscript_TSSenrichmentPlot_pyPiper.R to be in the same directory +# Dependencies: Script requires PEPATAC_TSSenrichmentPlot.R to be in the same directory # For pyPiper, these two scripts would be in the tools directory # # Function: Script takes as input a BAM file and a bed file of single base positions and plots the enrichment of signal around those regions diff --git a/update_usage_docs.sh b/update_usage_docs.sh index 6f46b02b..3cb70498 100755 --- a/update_usage_docs.sh +++ b/update_usage_docs.sh @@ -1,3 +1,3 @@ # this script will auto-generate the usage docs. -pipelines/ATACseq.py --help > usage.md 2>&1 \ No newline at end of file +pipelines/ATACseq.py --help > usage.txt 2>&1 \ No newline at end of file diff --git a/usage.txt b/usage.txt index db648fbe..45cb1404 100644 --- a/usage.txt +++ b/usage.txt @@ -1,48 +1,51 @@ -usage: ATACseq.py [-h] [-N] [-I2 INPUT_FILES2 [INPUT_FILES2 ...]] - [-M MEMORY_LIMIT] [-Q SINGLE_OR_PAIRED] [-S SAMPLE_NAME] - [-P NUMBER_OF_CORES] [-D] [-I INPUT_FILES [INPUT_FILES ...]] - [-F] [-R] [-C CONFIG_FILE] [-O PARENT_OUTPUT_FOLDER] - [-G GENOME_ASSEMBLY] [-gs GENOME_SIZE] - [--frip-ref-peaks FRIP_REF_PEAKS] [--pyadapt] [--skewer] +usage: ATACseq.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE] + [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT] + [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES + [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G + GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE] + [--frip-ref-peaks FRIP_REF_PEAKS] + [--peak-caller {fseq,macs2}] + [--trimmer {trimmomatic,pyadapt,skewer}] [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V] -Pipeline +ATACseq version 0.7.0 optional arguments: -h, --help show this help message and exit - -N, --new-start Fresh start mode, overwrite all - -I2 INPUT_FILES2 [INPUT_FILES2 ...], --input2 INPUT_FILES2 [INPUT_FILES2 ...] - One or more secondary input files (if they exists); - for example, second read in pair. - -M MEMORY_LIMIT, --mem MEMORY_LIMIT - Memory string for processes that accept memory limits - (like java) - -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED - single or paired end? default: single - -S SAMPLE_NAME, --sample-name SAMPLE_NAME - unique name for output subfolder and files (required) - -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES - number of cores to use for parallel processes - -D, --dirty Make all cleanups manual - -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] - One or more primary input files (required) - -F, --follow Run all follow commands, even if command is not run - -R, --recover Recover mode, overwrite locks + -R, --recover Overwrite locks to recover from previous failed run + -N, --new-start Overwrite all results to start a fresh run + -D, --dirty Don't auto-delete intermediate files + -F, --force-follow Always run 'follow' commands -C CONFIG_FILE, --config CONFIG_FILE - pipeline config file in YAML format; relative paths - are considered relative to the pipeline script. - defaults to ATACseq.yaml + Pipeline configuration file (YAML). Relative paths are + with respect to the pipeline script. -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER - parent output directory of the project (required). - -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY - identifier for genome assempbly (required) + Parent output directory of project + -M MEMORY_LIMIT, --mem MEMORY_LIMIT + Memory limit (in Mb) for processes accepting such + -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES + Number of cores for parallelized processes + -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]] + Secondary input files, such as read2 + -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED + Single- or paired-end sequencing protocol -gs GENOME_SIZE, --genome-size GENOME_SIZE genome size for MACS2 --frip-ref-peaks FRIP_REF_PEAKS Reference peak set for calculating FRIP - --pyadapt Use pyadapter_trim for trimming? [Default: False] - --skewer Use skewer for trimming? [Default: False] + --peak-caller {fseq,macs2} + Name of peak caller + --trimmer {trimmomatic,pyadapt,skewer} + Name of read trimming program --prealignments PREALIGNMENTS [PREALIGNMENTS ...] Space-delimited list of reference genomes to align to before primary alignment. -V, --version show program's version number and exit + +required named arguments: + -S SAMPLE_NAME, --sample-name SAMPLE_NAME + Name for sample to run + -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] + One or more primary input files + -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY + Identifier for genome assembly