# Processing ATAC-seq data for maxATAC
There are many ATAC-seq data processing pipelines available. For example:
* [`TOBIAS`](https://github.com/loosolab/TOBIAS)
* [`PEPATAC`](http://pepatac.databio.org/en/latest/)
* [`ATACseqQC`](https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html)
* [`HINT-ATAC`](https://reg-gen.readthedocs.io/en/latest/hint/introduction.html)
Additionally, an interactive tutorial on how to process ATAC-seq data is available from the [Galaxy Project Github page](https://galaxyproject.github.io/training-material/topics/epigenetics/tutorials/atac-seq/tutorial.html). All the methods mentioned use about the same approach for processing ATAC-seq, but do not include all the steps to generate an ATAC-seq signal track that is normalized for maxATAC.
This walkthrough details the steps that are necessary for retrieving and processing ATAC-seq data for use with maxATAC.
The overall workflow is implemented in several ways:
1) The [`maxatac prepare`](https://github.com/MiraldiLab/maxATAC/blob/main/docs/readme/prepare.md#Prepare) function to process BAM files to min-max normalized bigwigs.
2) [snakeATAC](https://github.com/tacazares/snakeATAC), a snakemake pipeline for ATAC-seq to process data from SRA.
3) A [common workflow language](https://github.com/Barski-lab/workflows/blob/master/devel/miraldi/atacseq-pe-bowtie2.cwl) pipeline for ATAC-seq.
4) A [simple](https://github.com/MiraldiLab/maxATAC_data/blob/main/scripts/ATAC/ATAC_bowtie2_pipeline.sh) shell script.
## Data Requirements
### Paired-end sequencing vs single-end sequencing for ATAC-seq
The ATAC-seq experiments used for training maxATAC models are generated using the Illumina paired-end library format. The Illumina PE technology allows the sequencing of both DNA strands enabling placement of reads more accurately. This approach uses the same amount of DNA, but sequence both ends of the DNA fragment. PE sequencing is currently the standard for ATAC-seq and is the data type that this write-up focuses on.
#### Paired-end sequencing increases the mapping rate for the same amount of input
ATAC-seq data analysis benefits from PE sequencing because open regions of chromatin can fall in repetitive regions or regions with structural variations. Using SE reads would increase the total sequencing depth required to achieve the same coverage as PE sequencing. One major contributor to this problem is that SE reads (35-100 bp) are harder to place with modern aligners, especially in regions containing repetitive and structural variants (Figure 1). PE sequencing helps in these cases by providing information about fragment size that can help alignment algorithms more accurately place the reads.
![PE](https://github.com/MiraldiLab/maxATAC/wiki/figures/Paired_end_sequencing.png)
Fig.1 - Paired-end sequencing can resolve multi-mapped reads due to repeat elements in the genome. (Top) Single-end reads only sequence one end (red squares) of the DNA fragment (dashed yellow line). This ~35-100 bp sequence can map multiple locations if they fall within repeat elements. (Bottom) Paired-end sequencing sequences both ends of the DNA fragment(red and blue squares). The size of the fragment (solid yellow line) can help alignment algorithms resolve hard to place reads.
#### Paired-end sequencing can be used to identify nucleosome positions
Additionally, PE data provides us information about sequencing fragment size distributions that can be used to infer nucleosome positioning[@Buenrostro2013]. The method for performing this type of analysis is covered later in [assess fragment length distribution](#quality-control-of-read-alignment)
___
## From `.sra` to `.fastq`: Retrieving Public ATAC-seq Data
Millions of next generation experiments have been performed across the world and deposited in public databases for free use by other scientists. Often, you will find yourself wanting to retrieve experimental data that might have been performed by another group to compare or supplement your own work. The quick access to sequencing data for use across the world is a hallmark of the National Center for Biotechnology Information ([NCBI](https://www.ncbi.nlm.nih.gov/)) databases such the short read archive ([SRA](https://www.ncbi.nlm.nih.gov/sra)) and gene expression omnibus ([GEO](https://www.ncbi.nlm.nih.gov/geo/)) [@Wheeler2008].
GEO is a database that catalogues experimental data such as sequencing data, data analysis, and expression data related to the study of gene expression grouped by project[@Wheeler2008]. The GEO database helps organize sequencing data into coherent experiments and projects that are labeled by an accession that starts with the prefix `GSM` or `GSE`.
Most GEO entries will link to an entry in the SRA database of raw sequencing data that was used in the experiments. We use the GEO database to identify the raw sequencing data in the form of a `.sra` file that are stored in the SRA database.
The `.sra` files represent the raw data associated with a `SRR` ID record. We consider these to be technical replicates. The `SRX` ID is used to organize the may `SRR` IDs that can be associated with an experiment. We consider the `SRX` ID the biological replicate.
In addition to GEO, we will use data from the Encyclopedia of DNA Elements ([ENCODE](https://www.encodeproject.org/)) [@Abascal2020] to supplement annotations available from GEO. The ENCODE data are QC'd, annotated, and reproducibility processed compared to data randomly generated and dumped in GEO. There is also overlap between the two databases, but sometimes there is a lag in deposits and annotations.
**In this section, we will:**
1) Identify a maxATAC ATAC-seq sample that we are interested in.
2) We will then use GEO and ENCODE to find the publicly available OMNI ATAC-seq raw data for download.
3) Finally, we will download the sequencing data in `.sra` format from the SRA database and convert the reads to a `.fastq` file. A `.sra` file is an archival format of next generation sequencing data used by NCBI to store the output of sequencing machines.
### Curating ATAC-seq data from maxATAC metadata
We have curated the cell types with ATAC-seq and ChIP-seq (as of July 2021) from multiple databases and this data is available in [supplementary table 1](https://www.biorxiv.org/content/10.1101/2022.01.28.478235v2.supplementary-material#:~:text=%5Bsupplements/478235_file02.xlsx%5D) of the maxATAC publication. Additionally, the file [`maxATAC_v1_meta.tsv`](data/maxATAC_v1_meta.tsv) is an updated table is available with SRR IDs instead of ENCODE IDs.
For this example we will use MCF-7 from ENCODE as an example. The experimental meta information is available in the table below.
| ENCODE_ACCESSION | CELL_TYPE | ENCODE_BIO_REP | TECH_REP | BIO_REP | SOURCE |
|------------------|-----------|----------------|-------------|-------------|--------|
| ENCSR422SUG | MCF-7 | 1 | SRR14103347 | SRX10475183 | ENCODE |
| ENCSR422SUG | MCF-7 | 2 | SRR14103348 | SRX10475184 | ENCODE |
The next few steps are geared towards using curated data from the maxATAC publication to find and download raw `.fastq` files. You can still use the following approaches for your own experimental data as well, you will just need to find the accession in your publication of interest.
If you want to skip downloading, alignment, and filtering, you can download the processed data from the [maxATAC Zenodo page](https://zenodo.org/record/6761768#.Yt666OzML0o). The signal tracks are available for biological replicates.
### Download `.fastq` files from ENCODE
This section focuses on downloading the samples that were derived from ENCODE.
The `BIO_REP` column in the maxATAC metadata corresponds to the individual experimental biological replicate. ENCODE samples in our meta data are labeled by the experiment accession. For example, the entry `ENCSR422SUG_1` corresponds to the experiment [`ENCSR422SUG`](https://www.encodeproject.org/experiments/ENCSR422SUG/).
![ENCODE Experiment Summary for ENCSR422SUG](https://github.com/MiraldiLab/maxATAC/wiki/figures/ENCODE_experiment_summary.png)
Fig.2 - Example Experiment Summary for sample ENCSR422SUG. (Blue) The experiment for ENCSR422SUG. (Red)There are two isogenic replicates that have accession IDs that are found in the TECH_REP column of the maxATAC data table.
***NOTE***: The accession IDs found in the maxATAC metatable have the suffix `_1` or `_2` next to the ENCODE experimental ID to denote that we consider those entries to be biological replicates. The ENCODE definition of biological replicate is described below:
[ENCODE Data Standards](https://www.encodeproject.org/data-standards/terms/)
> Biological replication – Replication on two distinct biosamples on which the same experimental protocol was performed. For example, on different growths, two different knockdowns, etc.
>
> Isogenic replication – Biological replication. Two replicates from biosamples derived from the same human donor or model organism strain. These biosamples have been treated separately (i.e. two growths, two separate knockdowns, or two different excisions).
>
> Technical replication – Two replicates from the same biosample, treated identically for each replicate (e.g. same growth, same knockdown).
The maxATAC metadata breaks the single ENCODE bioreplicate into two bioreplicates based on the available isogenic replicates (Figure 2; red box). For example, with experiment `ENCSR422SUG`, the isogenic replicate `ENCBS105SMU` corresponds to `ENCSR422SUG_1`.
You can download the ENCODE data by adding it to your ENCODE cart. The cart for all maxATAC data used from ENCODE is [here](https://www.encodeproject.org/carts/b183bf07-e6fe-4636-b8b6-8d0aa45e777e/). The cart allows you to build a meta file that can be used to systematically download the fastq files. The meta file for all ATAC-seq data used by maxATAC is [here](./data/ENCODE_ATAC_fastq.txt).
The easiest way to download these files is to use `curl`.
Example to download all ATAC-seq data:
```bash
xargs -L 1 -P 8 curl -O -J -L < ENCODE_ATAC_fastq.txt
```
#### 07/26/2022 Update
The original ENCODE data curated by maxATAC was referencing ENCODE IDs. The updated maxATAC meta data for the ATAC-seq data references the individual `SRR` and `SRX` IDs. Therefore, this previous method is unnecessary.
### Download `.sra` files and convert them to `.fastq` files with SRA `prefetch` and `fasterq-dump` using SRR ID
NCBI has developed a toolkit, [SRAtoolkit](https://github.com/ncbi/sra-tools), for interacting with genomics data in their databases. This toolkit has commands like `prefetch` and `fastq-dump` that can be used to retrieve publicly available sequencing data. The `prefetch` to `fastq-dump` workflow allows for pre-downloading SRA files and converting those files locally as needed.
Steps for downloading `.sra` files with SRAtoolkit:
1) We use SRA `prefetch` to pre-download the SRA file to our system.
2) Then we use `fasterq-dump` to convert the SRA file to `.fastq` format.
#### Prefetch the SRA files
The input to the prefetch function is the `SRR` or `SRX` ID.
```bash
> prefetch SRR14103347
2022-07-26T16:36:20 prefetch.2.10.1: 1) Downloading 'SRR14103347'...
2022-07-26T16:36:20 prefetch.2.10.1: Downloading via https...
2022-07-26T16:44:13 prefetch.2.10.1: https download succeed
2022-07-26T16:44:13 prefetch.2.10.1: 1) 'SRR14103347' was downloaded successfully
2022-07-26T16:44:13 prefetch.2.10.1: 'SRR14103347' has 0 unresolved dependencies
```
#### Convert the SRA files to `.fastq` files
You do not need to prefetch the `.sra` file before using `fasterq-dump`, but prefetching helps pre-download files for faster dumping. You can change the `-e` flag to fit the number of cores available.
```bash
> fasterq-dump -e 2 SRR14103347
spots read : 45,575,539
reads read : 91,151,078
reads written : 91,151,078
```
#### Gzip files to save space
The `fasterq-dump` function does not gzip files as they are converted to `.fastq`. This results in pretty large plain-text `.fastq` files. For example, the `.fastq` files for `SRR14103347` total 24GB.
```bash
> du -h SRR14103347_*.fastq
12G SRR14103347_1.fastq
12G SRR14103347_2.fastq
```
If you gzip the files you can decrease the `.fastq` file size by more than half.
```bash
> pigz *.fastq
> du -h SRR14103347_*.fastq.gz
2.4G SRR14103347_1.fastq.gz
2.6G SRR14103347_2.fastq.gz
```
Most software can work with the gzipped files directly. This saves space and time. You should work with the gzipped files whenever possible and only unzip them as needed. Some `.fastq` files can be >30GB when unzipped.
If you are downloading a lot of SRA files and have limited space, make sure to use [`vdb-config -i`](https://github.com/ncbi/sra-tools/blob/master/README-vdb-config) to set the download path to a disk with enough space.
### Combining technical replicates
Technical replicates are concatenated into a single file per read. For all maxATAC samples, the biological replicate is the `SRX` ID and the technical replicates correspond to the `SRR` IDs.
The following code assumes that the technical replicates are in the same directory grouped by sample.
```bash
# cat read 1 technical replicates
> cat *_1.fastq.gz > ${srx}_1.fastq.gz
# cat read 2 technical replicates
> cat *_2.fastq.gz > ${srx}_2.fastq.gz
```
___
## Read QC and Trimming
After technical replicates are combined into a single biological replicate `.fastq` file, the reads can be QC'd. Quality control of reads is extremely important to generating high quality ATAC-seq signal tracks.
Reads produced from sequencing experiments use adapters to help index, prime, and keep track of reads in an experiment. [These sequencing primers need to be removed before downstream data analysis](https://support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-from-only-the--ends-of-reads.html).
The adapters contain the sequencing primer binding sites, the index sequences, and the sites that allow library fragments to attach to the flow cell lawn. Libraries prepared with Illumina library prep kits require adapter trimming only on the 3’ ends of reads, because adapter sequences are not found on the 5’ ends. Most of our experiments use paired-end Illumina sequencing technology, so we must process the `.fastq` file that is downloaded with that adapter trimming algorithm.
The steps for quality control of paired-end `.fastq` files:
1) The read quality is assessed with [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
2) Adapter contamination and low quality bases are trimmed from reads.
3) Assess read quality post trimming.
### Check read quality with `FastQC`
We use [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to assess the quality of the individual reads in each of the `.fastq` files. With ATAC-seq data, we are primarily interested:
1) The overall quality of the reads in the file based on PHRED[@Ewing01031998] score
2) The degree of sequence duplication
3) The degree of adapter contamination
4) The length of the reads
We will also use `FastQC` to assess read quality after adapter trimming. A wonderful guide to FastQC is available through the [HBC Training Github page](https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/02_assessing_quality.html). In this section I will not describe all the statistics, but will focus on assessing the quality of the example sample we have been using.
The command for fastqc is simple. The `-t 8` flag is used to set how many threads are available. Then the user provides a list of `.fastq` files. These files can be gzipped. Below is an example command and output.
```bash
> fastqc -t 8 *gz
Started analysis of SRR14103347_1.fastq.gz
Started analysis of SRR14103347_2.fastq.gz
Approx 5% complete for SRR14103347_1.fastq.gz
Approx 5% complete for SRR14103347_2.fastq.gz
...
Approx 95% complete for SRR14103347_1.fastq.gz
Approx 95% complete for SRR14103347_2.fastq.gz
Analysis complete for SRR14103347_1.fastq.gz
Analysis complete for SRR14103347_2.fastq.gz
```
The output of `FastQC` is a `.html` file and a `.zip` file with the data used in the analysis. The statistics from the `.html` file for SRR14103347_1.fastq.gz are shown in the examples below.
#### A note on PHRED scores
The [PHRED](https://en.wikipedia.org/wiki/Phred_quality_score) score describes the quality of the base call from the sequencing machine. This value corresponds to:
$$Q=-10\ log_{10}(P)$$
Where Q is the PHRED score and P is the base-calling error probability.
| Phred Quality Score | Probability of incorrect base call | Base call accuracy |
|:-------------------:|:----------------------------------:|:------------------:|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10,000 | 99.99% |
| 50 | 1 in 100,000 | 99.999% |
| 60 | 1 in 1,000,000 | 99.9999% |
Most reads used for ATAC-seq are greater than 75 bp long so triming low quality bases from the ends will not effect the mappability as much as say a 35 bp read. If you are using ChIP-seq data the PHRED score should be lower, at around 20.
___
#### Read statistics
The top of the `FastQC` output shows overall read statistics for the `.fastq` files.
![Pre Trim Stats](https://github.com/MiraldiLab/maxATAC/wiki/figures/fastqc_pre_trim_stats.png)
Fig.3 - FastQC summary statistics for SRR14103347_1.fastq.gz.
Important information in this section include the total sequences in the `.fastq`, the lengths of the reads, and the %GC content of the reads.
The field `Sequences flagged as poor quality` indicates that this `.fastq` file does not have any concerning sequences that are "low quality", but the reads could need additional processing as noted in the next sections.
___
#### Per base sequence quality score
The per base sequence quality score plot shows the distribution of PHRED scores at each base in the read. In the next step of read processing, adapter trimming, we will remove low quality base calls from the end of the read. We use a cutoff of PHRED >= 30.
![Per base sequence quality score](figures/fastqc_per_base_sequence_quality_pre_trim.png)
Fig.5 - Quality scores across all bases.
This plot shows that if you line all the reads up, the high quality base calls are found at the start of the read (Figure 5;bases 0-34). They gradually fall off as you travel down the read, with many dropping below the 30 cutoff. This is normal for sequencing experiments, but the majority of bases should be above your desired quality threshold.
___
#### Per sequence quality score
The per sequence quality score metric shows the distribution of the mean PHRED score across all reads. That means for every read, the mean PHRED score was found. Then the distribution of those mean PHRED score for all of the reads in the `.fastq` were plotted.
![Per sequence quality score](figures/fastqc_per_sequence_quality_score.png)
Fig.5 - Quality score distribution over all sequences.
This plot shows that most reads have a mean PHRED score >37. This distribution is an example of high-quality reads in the `.fastq`. This mainly shows that the base-calling was accurate.
___
#### Per base sequence content
The per base sequence content plot shows the nucleotide composition at each base pair in the read. Each nucleotide is represented by a different colored line.
![Per base sequence content](figures/fastqc_per_base_sequence_content.png)
Fig.6 - Sequence content across all bases.
From this plot you can see there are nucleotide preferences at the 5' end of the read (left). This is common in ATAC-seq expriments due to the known Tn5 sequence bias. In other types of experiments, you would want a random distribution of nucleotides in the read. That is why this metric was originally flagged by `FastQC`.
___
#### Adapter content
The next step is to identify any adapter contamination. If the insert size is less than the sequencing length, then the adapters at the 3' end of the read are included in the sequence. This excerpt from [ecSeq Bioinformatics](https://www.ecseq.com/support/ngs/trimming-adapter-sequences-is-it-necessary) helps explain how adapters contaminate reads:
>In common short read sequencing, the DNA insert (original molecule to be sequenced) is downstream from the read primer, meaning that the 5' adapters will not appear in the sequenced read. But, if the fragment is shorter than the number of bases sequenced, one will sequence into the 3' adapter. To make it clear: In Illumina sequencing, adapter sequences will only occur at the 3' end of the read and only if the DNA insert is shorter than the number of sequencing cycles (see picture below)!
![flowcell image](https://www.ecseq.com/support/ngs/img/fragmentsize.png)
Fig.7 - Image from ecSes Bioinformatics showing the attachment of fragments to a flow cell.
The `FastQC` software will do a quick check to look for common adapters in a sample of the reads.
![Adapter contamination](figures/fastqc_adapter_content.png)
Fig.8 - Adapter content.
The adapter content plot shows adapter contamination from Nextera transposase sequences (Figure 8; black line). These adapters will be [trimmed from the read](#trimming-adapters-from-reads-using-the-trim-galore-package) to help increase mappability. Note that most adapter contamination occurs at the 3' end of the read, at mostly >36 bps.
___
#### Sequence duplication level
The final metric that is important to look at is the level of sequence duplication across the reads. This metric looks at how much each read is duplicated.
![Sequence duplication level](figures/fastqc_per_sequence_duplicates_levels.png)
Fig.9 - Sequence duplication levels.
Our sample has a warning that the sequence duplication levels are a bit high. About 30% of the reads are duplicates. These could be caused by biological causes such as falling within repeat elements and also due to technical biases such as PCR duplication. We will account for duplication during our [PCR deduplication](#remove-pcr-duplicates) step described later on.
___
### Trimming adapters from reads using the Trim Galore! package
We use [`Trim Galore!`](https://github.com/FelixKrueger/TrimGalore) to trim adapters and low quality bases from the 3' ends of reads. `Trim Galore!` uses the popular [`Cutadapt`](https://cutadapt.readthedocs.io/en/stable/) algorithm with `FastQC` to apply adapter trimming.
We use the following command to trim adapters from reads:
```bash
> trim_galore -q 30 -paired -j 4 SRR14103347_1.fastq.gz SRR14103347_2.fastq.gz
```
The `-q` flag keeps only reads with a [PHRED](#a-note-on-phred-scores) score 30 or greater.
The `-paired` flag is used to specify that the data is from paired-end sequencing experiments.
The `-j 4` flag is used to specify that the jobs uses 4 threads, but at least 16 cores are needed. See [documentation](https://github.com/FelixKrueger/TrimGalore).
The next sections break up the output and provide a brief overview of the process used by `Trim Galore!`.
___
#### Parameter description and setup
All of the parameters for the environment and run are listed on startup of `Trim Galore!`. The program will test for different version of cutadapt, whether pigz is installed, and what adapters are detected from sampling the reads.
```bash
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 1.18
Could not detect version of Python used by Cutadapt from the first line of Cutadapt (but found this: >>>#!/bin/sh<<<)
Letting the (modified) Cutadapt deal with the Python version instead
pigz 2.6
Parallel gzip (pigz) detected. Proceeding with multicore (de)compression using 4 cores
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
```
___
#### Adapter detection and trimming
`Trim Galore!` will start with auto-detecting adapters using a subsample of the reads. Any matches are reported.
```bash
AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> SRR14103347_1.fastq.gz <<)
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Nextera 271028 CTGTCTCTTATA 1000000 27.10
smallRNA 0 TGGAATTCTCGG 1000000 0.00
Illumina 0 AGATCGGAAGAGC 1000000 0.00
Using Nextera adapter for trimming (count: 271028). Second best hit was smallRNA (count: 0)
Writing report to 'SRR14103347_1.fastq.gz_trimming_report.txt'
```
`Trim Galore!` detected Nextera adapters in 27% of the reads sampled with no other major contamination detected.
After adapters have been identified, use cutadapt to trim adapters from reads for the detected sequence:
```bash
SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR14103347_1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 1.18
Python version: could not detect
Number of cores used for trimming: 4
Quality Phred score cutoff: 30
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file(s) will be GZIP compressed
Cutadapt seems to be reasonably up-to-date. Setting -j 4
Writing final adapter and quality trimmed output to SRR14103347_1_trimmed.fq.gz
```
___
#### Performing quality filtering
This next step shows the results from trimming the reads for bases with PHRED <30 and removing adapter contamination. The sequence `CTGTCTCTTATA` is searched for in all of the 3' ends of the sequencing reads.
```bash
>>> Now performing quality (cutoff '-q 30') and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATA' from file SRR14103347_1.fastq.gz <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
This is cutadapt 1.18 with Python 3.7.12
Command line parameters: -j 4 -e 0.1 -q 30 -O 1 -a CTGTCTCTTATA SRR14103347_1.fastq.gz
Processing reads on 4 cores in single-end mode ...
Finished in 285.77 s (6 us/read; 9.57 M reads/minute).
=== Summary ===
Total reads processed: 45,575,539
Reads with adapters: 25,001,500 (54.9%)
Reads written (passing filters): 45,575,539 (100.0%)
Total basepairs processed: 4,603,129,439 bp
Quality-trimmed: 230,108,812 bp (5.0%)
Total written (filtered): 3,854,541,276 bp (83.7%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 25001500 times.
No. of allowed errors:
0-9 bp: 0; 10-12 bp: 1
Bases preceding removed adapters:
A: 13.5%
C: 40.6%
G: 27.0%
T: 18.8%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
1 9194054 11393884.8 0 9194054
2 2148639 2848471.2 0 2148639
3 946834 712117.8 0 946834
...
100 106 2.7 1 5 101
101 91 2.7 1 37 54
```
The above statistics show that about 54% of reads have adapters detected. There are also statistics on the number of base pairs that have been quality trimmed. The total reads written decreases because some of the reads are discarded if >30% of the read sequence is ambiguous.
The same statistics are generated for read2, but only the summary results are shown for brevity.
```bash
=== Summary ===
Total reads processed: 45,575,539
Reads with adapters: 24,609,548 (54.0%)
Reads written (passing filters): 45,575,539 (100.0%)
Total basepairs processed: 4,603,129,439 bp
Quality-trimmed: 420,442,643 bp (9.1%)
Total written (filtered): 3,662,980,126 bp (79.6%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 24609548 times.
```
___
#### Read validation
At the end of the process read pairs are validated for size according to the [`Trim Galore!` manual](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/trim_galore_User_Guide_v0.3.7.pdf). According to the manual:
>This step removes entire read pairs if at least one of the two sequences became shorter than a certain threshold.
```bash
>>>>> Now validing the length of the 2 paired-end infiles: SRR14103347_1_trimmed.fq.gz and SRR14103347_2_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to SRR14103347_1_val_1.fq.gz
Writing validated paired-end Read 2 reads to SRR14103347_2_val_2.fq.gz
Total number of sequences analysed: 45575539
Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 1879987 (4.12%)
Deleting both intermediate output files SRR14103347_1_trimmed.fq.gz and SRR14103347_2_trimmed.fq.gz
====================================================================================================
```
The output of `Trim Galore!` that will be aligned to the genome will have the extension `_{read}_val_{read}_.fq.gz` like the file `SRR14103347_1_val_1.fq.gz`.
___
## Read Alignment with Bowtie2 and Quality Control
Reads are aligned to the reference genome using bowtie2[@Langmead2012]. We have tested other alignment algorithms (bwa-mem[@Li2009] and STAR[@Dobin2013]), but performance was similar across methods for maxATAC predictions.
After read alignment, the resulting alignments need to be quality controlled and explored with summary statistics such as fragment size distribution.
Additionally, low quality alignments, reads mapping to unwanted chromosomes, and PCR duplicates are excluded.
### A brief note on MAPQ scores
The metric we used to gauge the quality of read alignment is called the [MAPQ](http://samtools.github.io/hts-specs/SAMv1.pdf) score. This guide is usefule for understanding [why we use MAPQ score](http://www.acgt.me/blog/2014/12/16/understanding-mapq-scores-in-sam-files-does-37-42).
|Flag|Description |
|--- |--- |
|1 |read is mapped |
|2 |read is mapped as part of a pair |
|4 |read is unmapped |
|8 |mate is unmapped |
|16 |read reverse strand |
|32 |mate reverse strand |
|64 |first in pair |
|128 |second in pair |
|256 |not primary alignment |
|512 |read fails platform/vendor quality checks|
|1024|read is PCR or optical duplicate |
### Aligning reads with Bowtie2
We use the following command for aligning reads to the hg38 reference genome, make sure to use the output from `Trim Galore!`. The bowtie2 function can take some time to align your reads depending on the number of threads available and the size of the library. Therefore, bowtie2 alignment isn't included in the maxATAC package, but is included in [snakeATAC](https://github.com/tacazares/snakeATAC) and our shell scripts.
***After generating a `.bam` file using the code below, you can use the [`maxatac prepare`](https://github.com/MiraldiLab/maxATAC/blob/main/docs/readme/prepare.md#Prepare) function to perform the filtering and signal generation.***
```bash
> bowtie2 --very-sensitive --maxins 2000 -p 16 -x ./hg38/bowtie2_index/hg38 -1 SRR14103347_1_val_1.fq.gz -2 SRR14103347_2_val_2.fq.gz -S SRX10475183.sam
```
The `--very-sensitive` setting defines a set of preset parameters for the matching algorithm that prioritize [accuracy over speed](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).
The `--maxins 2000` flag defines that maximum fragment size to [2,000 bp](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#:~:text=%2DX-,/,-%2D%2Dmaxins%20%3Cint%3E).
The `-1` and `-2` flags correspond to the `.fastq` files for reads 1 and 2, respectively.
The `-S` flag creates a `.sam` file as the output. This might not be the most effective use of hard disk space. Instead you can also use `-b` to output a `.bam` file. The trade-off is the speed and time required to compress the data.
#### Alignment statistics
The output from `bowtie2` are simple summary statistics about alignment rate, number of discordant reads, and number of multi-mapped reads.
```bash
> bowtie2 --very-sensitive --maxins 2000 -p 16 -x ./hg38/bowtie2_index/hg38 -1 SRR14103347_1_val_1.fq.gz -2 SRR14103347_2_val_2.fq.gz -S SRR14103347.sam
43695552 reads; of these:
43695552 (100.00%) were paired; of these:
1391108 (3.18%) aligned concordantly 0 times
36099610 (82.62%) aligned concordantly exactly 1 time
6204834 (14.20%) aligned concordantly >1 times
----
1391108 pairs aligned concordantly 0 times; of these:
38497 (2.77%) aligned discordantly 1 time
----
1352611 pairs aligned 0 times concordantly or discordantly; of these:
2705222 mates make up the pairs; of these:
2307127 (85.28%) aligned 0 times
282955 (10.46%) aligned exactly 1 time
115140 (4.26%) aligned >1 times
97.36% overall alignment rate
```
Sample `SRR14103347` had an overall alignment rate of 97.36%. This is a very high alignment rate. The number of reads that mapped to more than one location was 14.20%. Depending on the ATAC-seq protocol, the degree of duplication can have a large range. In general, aim for <30%.
## Filtering Alignments
After the reads has been aligned to the genome, the aligned reads are filtered. Filtering alignments is important in order to remove PCR duplication events. The [`maxatac prepare`](https://github.com/MiraldiLab/maxATAC/blob/main/docs/readme/prepare.md#Prepare) function will perform the following steps given an input `.bam` or scATAC-seq fragments file. We want to remove reads that are the product of PCR duplicates. These are reads that have the exact same mapping position and barcode. This helps reduce the library specific bias caused by PCR amplification of random sequences.
### Removing PCR duplicates
We remove PCR duplicates using `samtools sort` to sort and then `samtools fixmate` to fix named read pairs. We then use [`samtools markdup`](http://www.htslib.org/doc/samtools-markdup.html) to remove the duplicated reads. Some protocols use picard tools and this approach is also acceptable.
#### Sort reads by name
The first step will be to filter the reads by their MAPQ score and then[sort the reads by name](http://www.htslib.org/doc/samtools-sort.html). We want to filter out low quality alignments before generating any signal tracks. To do this, we use the `-q 30` flag with `samtools view`.
```bash
# Sort by NAME http://www.htslib.org/doc/samtools-sort.html
> samtools view -@ 8 -b -u -q 30 SRR14103347.bam | \
samtools sort -@ 8 -n -o SRR14103347_namesorted_q30.bam -
```
#### Fixmates and then sort by position
After the reads are sorted by their names, [`samtools fixmate`](http://www.htslib.org/doc/samtools-fixmate.html) will fill in the information for their mates as stated in the documentation:
>Fill in mate coordinates, ISIZE and mate related flags from a name-sorted or name-collated alignment.
```bash
# Fixmate and sort by POSITION then index http://www.htslib.org/doc/samtools-fixmate.html
> samtools fixmate -@ 8 -m SRR14103347_namesorted.bam - | \
samtools sort -@ 8 -o SRR14103347_fixmates.bam -
# The file needs to be indexed after sorting by position
> samtools index -@ 8 SRR14103347_fixmates.bam
```
#### Mark duplicates and remove them
The final step requires using [`samtools markdup`](http://www.htslib.org/doc/samtools-markdup.html) to mark and remove duplicates. The output is position sorted and indexed.
```bash
# Mark duplicates, remove, sort, index
> samtools markdup -@ 8 -r -s SRR14103347_fixmates.bam - | \
samtools sort -@ 8 -o SRR14103347_deduplicated.bam -
> samtools index SRR14103347_deduplicated.bam
```
The `-r` flag with `samtools markdup` is used to remove the duplicates from the file.
The `-s` flag with `samtools markdup` prints some summary statistics shown below.
```bash
READ 87391104 WRITTEN 70733337
EXCLUDED 2307127 EXAMINED 85083977
PAIRED 84760956 SINGLE 323021
DULPICATE PAIR 16467008 DUPLICATE SINGLE 190759
DUPLICATE TOTAL 16657767
```
Out of `87391104` reads, `70733337` were written to output. In total there were `16657767` PCR duplicates.
### Removing specific chromosomes while keeping only properly paired and oriented reads without singletons
The mitochondrial genome is abundant in every cell and in Standard ATAC-seq experiments there is a lot of mitochondrial contamination. We remove all reads associated with the mitochondrial genome from our analysis. Additionally, we focus on the autosomal chromsomes that are found in the reference genome.
During the filtering process we also want to double check that no singleton reads are stored in the BAM file. We use samtools to do perform filtering using the `-f` flag for inclusion of reads and the `-F` flag for exclusions of reads. We specify the `-f 3` flag to filter for [properly paired and oriented reads](https://broadinstitute.github.io/picard/explain-flags.html). We also want to make sure that the BAM file does not contain any unmapped reads either, so we use the `samtools view` flag [`-F 4`](https://broadinstitute.github.io/picard/explain-flags.html) flag to exclude them.
We want to remove reads that are unmapped and singletons because we will use `bedtools bamtobed` to generate bed files for all entries. The pre-filtering ensures you only have the paired reads left when you convert your reads to a BED file, as each bed interval will be considered a cut-site in later processing.
This is an example:
```bash
> samtools view -@ 16 -f 3 -b -F 4 SRR14103347_deduplicated.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY | \
samtools sort -@ 16 -o SRR14103347_final.bam -
```
After the `.bam` file has filtered, we will next convert the read entries to BED intervals that represent the read as Tn5 cut sites.
## Generating Tn5 cut site signal tracks: Filtered BAM to Bigwig
The underlying enzyme that enables ATAC-seq is the Tn5 transposase. We want to leverage our knowledge of Tn5 biology and enzyme action to better resolve the exact location that the DNA was cut. We will generate cutsites using `bedtools bamtobed`, `bedtools slop`, and `awk`. For a more detailed description of Tn5 action during ATAC-seq, check out the original ATAC-seq publication from Jason Buenrostro[@Buenrostro2013], the [`HINT-ATAC`](https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-019-1642-2.pdf) paper, and the publication describing [Tn5 for high-throughput sequencing](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-12-r119).
The key steps to generating the ATAC-seq signal track are:
1) Shift reads +4 bp on the (+) strand and -5 bp on the (-) strand.
![Tn4 read shifting](figures/tn5_read_shift.png)
2) Slop cut sites to 40 bp to smooth the sparse cut site signal. Image adapted from [`HINT-ATAC`](https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-019-1642-2.pdf).
![Slop size](figures/tn5_slop_size.png)
### Tn5 5' read shifting and inferring Tn5 sites
The Tn5 transposase homodimer cuts DNA to insert sequencing adapters. During transposition, two single stranded DNA ends are left at the 5' end that are 9bp long. The figure below is adapted from the HINT-ATAC publication and shows the crystal structure of Tn5 compared to the DNA it is bound to (Figure 8). Tn5 action causes two single strand 5' ends that are 9 bp long that are ligated to adapters.
![Hint ATAC figure 1C](figures/HINT-ATAC_fig1C.png)
Fig.8 - HINT-ATAC Figure 1C. This figure from the HINT-ATAC publication shows the Tn5 dimer cutting DNA. Tn5 leaves two single stranded ends that are 9 bp long on the 5' end of DNA. Adapters are attached to the ends of the 9bp long sequence.
Due to the 9 bp overhang, the read end does not correspond to the Tn5 cut site location. The true cut site location is located at the +5 bp position from the read end (Figure 9).
![Hint ATAC figure 1C](figures/HINT-ATAC_fig1A.png)
Fig.9 - HINT-ATAC Figure 1A. This figure from the HINT-ATAC publication shows the true middle of the Tn5 cut site is +5 from the 5' end of the read.
Identifying the exact cut site of the Tn5 enzyme allows for inferring the position of the enzyme and its cleavage point. To find the true cut site, the reads are shifted + 4 bp on the (+) strand and a -5 bp on the (-) strand.
#### Converting a `.bam` to `.bed`
We will convert our reads in the format of a `.bam` file to a `.bed` file where each read corresponds to a row in the file. This method will result in both ends of the paired end fragments being considered during signal generation. We use [`bedtools bamtobed`](https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html) to perform this conversion.
Example command:
```bash
bedtools bamtobed -i SRR14103347_final.bam > SRR14103347_reads.bed
```
This is an example of the output from converting the `.bam` to `.bed`:
```bash
> head SRR14103347_reads.bed
chr1 9989 10087 SRR14103347.10889469/1 0 +
chr1 9989 10083 SRR14103347.30379212/1 6 +
chr1 9990 10057 SRR14103347.15135894/1 0 +
chr1 9991 10090 SRR14103347.11934073/1 2 +
chr1 9996 10047 SRR14103347.10094158/1 6 +
chr1 9996 10095 SRR14103347.18169633/1 0 +
chr1 9996 10095 SRR14103347.25344866/1 1 +
chr1 9996 10047 SRR14103347.41806681/2 2 +
chr1 9997 10098 SRR14103347.22439821/2 2 +
chr1 9997 10098 SRR14103347.31445403/1 1 +
```
These `.bed` files can be large as they are not compressed and are plain text. If you are keeping these files you should `gzip` them.
#### Shift read positions and find cut sites
In [BED intervals](https://angus.readthedocs.io/en/2013/_static/interval-datatypes.pdf) the 5' of the (+) strand is found as the chromosome start position (column 2) and the (-) strand is found as the chromosome end position (column 3).
Once the reads are in `.bed` format, shifting the reads is easy with `awk`. In addition to shifting the reads. We only care about the position of the cut site, so we create a 1 bp window around the start 5' of the BED interval.
Example:
```bash
> awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 4, $2 + 5, $4, $5, $6; else print $1, $3 - 5, $3 - 4, $4, $5, $6}' SRR14103347_reads.bed > SRR14103347_reads_shifted.bed
```
This is an example of shifted reads:
```bash
> head SRR14103347_reads_shifted.bed
chr1 9993 9994 SRR14103347.10889469/1 0 +
chr1 9993 9994 SRR14103347.30379212/1 6 +
chr1 9994 9995 SRR14103347.15135894/1 0 +
chr1 9995 9996 SRR14103347.11934073/1 2 +
chr1 10000 10001 SRR14103347.10094158/1 6 +
chr1 10000 10001 SRR14103347.18169633/1 0 +
chr1 10000 10001 SRR14103347.25344866/1 1 +
chr1 10000 10001 SRR14103347.41806681/2 2 +
chr1 10001 10002 SRR14103347.22439821/2 2 +
chr1 10001 10002 SRR14103347.31445403/1 1 +
```
#### Smooth signal using known Tn5 dimer width
The final step is to center the signal over the 5' end of the read that represents the Tn5 transposase binding event. This will provide us with regions of the genome that are occupied by Tn5 which requires a [minimal ~40bp accessible region to cleave DNA]([figures/tn5_biology.png](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-12-r119)) (figure 10).
![Tn5 Width](figures/tn5_biology.png)
Figure 10. Figure 4 from Adey 2010 et al. This figure shows that Tn5 has an approximate size of 38 bp between cut sites.
The window around the Tn5 cut size also helps to smooth the sparse cut site signal (figure 11).
![tn5 smoothing](figures/tn5_signal_smoothing.png)
Figure 11. Tn5 cut site smoothing using different window sizes of 1 bp, 20 bp, 50 bp, and 80 bp.
Example code:
```bash
> bedtools slop -i SRR14103347_reads_shifted.bed.gz \
-g hg38.chrom.sizes \
-b 20 > SRR14103347_Tn5_slop20.bed
```
This is an example of reads with a `-b 20`:
```bash
> head SRR14103347_Tn5_slop20.bed
chr1 9973 10014 SRR14103347.10889469/1 0 +
chr1 9973 10014 SRR14103347.30379212/1 6 +
chr1 9974 10015 SRR14103347.15135894/1 0 +
chr1 9975 10016 SRR14103347.11934073/1 2 +
chr1 9980 10021 SRR14103347.10094158/1 6 +
chr1 9980 10021 SRR14103347.18169633/1 0 +
chr1 9980 10021 SRR14103347.25344866/1 1 +
chr1 9980 10021 SRR14103347.41806681/2 2 +
chr1 9981 10022 SRR14103347.22439821/2 2 +
chr1 9981 10022 SRR14103347.31445403/1 1 +
```
#### Remove blacklisted regions
There are many lists of exclusion regions available online. The list we curated for maxATAC was developed for ATAC-seq data specifically. Below is a list of blacklists that have been curated for maxATAC and their coverage statistics of the hg38 genome. The importance of removing these blacklist regions is further explored in the importance of data normalization.
| file | Number of Intervals | Total bps covered | Percent hg38 covered |
|:-----------------------------:|:-------------------:|:-----------------:|----------------------|
| [ENCFF356LFX.bed](https://www.encodeproject.org/files/ENCFF356LFX/) | 910 | 71,570,285 | 2.17 |
| [hg38-blacklist.v2.bed](https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz) | 636 | 227,162,400 | 6.886 |
| [hg38_centromeres.bed](https://genome.ucsc.edu/cgi-bin/hgTables) | 109 | 59,546,786 | 1.805 |
| [hg38_gaps.bed](https://genome.ucsc.edu/cgi-bin/hgTables) | 827 | 161,348,343 | 4.891 |
| [hg38_maxatac_blacklist.bed](https://github.com/MiraldiLab/maxATAC_data/blob/main/hg38/hg38_maxatac_blacklist.bed) | 376 | 217,585,970 | 6.596 |
| hg38_maxatac_blacklist_V2.bed | 1,667 | 275,198,132 | 8.342 |
| [hg38_segmental_dups_chrM.bed](https://genome.ucsc.edu/cgi-bin/hgTables) | 12 | 36,418 | 0.001 |
Example code:
```bash
> bedtools intersect -a SRR14103347_Tn5_slop20.bed -b /data/miraldiLab/databank/genome_inf/hg38/hg38_maxatac_blacklist_V2.bed -v > SRR14103347_Tn5_slop20_blacklisted.bed
```
This example had `1,952,443` of `66,223,862` Tn5 sites overlapping blacklist regions.
```bash
> wc -l SRR14103347_Tn5_slop20_blacklisted.bed
64271419 SRR14103347_Tn5_slop20_blacklisted.bed
> wc -l SRR14103347_Tn5_slop20.bed
66223862 SRR14103347_Tn5_slop20.bed
```
#### One line to run them all
The above commands can be strung together into a single bash pipe for maximum efficiency!
```bash
> bedtools bamtobed -i SRR14103347.bam | \
awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 4, $2 + 4, $4, $5, $6; else print $1, $3 - 5, $3 - 5, $4, $5, $6}' | \
bedtools slop -i - -g hg38.chrom.sizes -b 20 | \
bedtools intersect -a - -b hg38_maxatac_blacklist_V2.bed -v | \
sort -k 1,1 > SRR14103347_Tn5_slop20_blacklisted.bed
```
## Generating genome-wide signal tracks of Tn5 counts
The final steps in processing your ATAC-seq data is to generate genome-wide signal tracks in `.bigwig` format that represent Tn5 counts at single-nucleotide resolution.
The overall steps are:
1) Convert Tn5 sites in `.bed` format to `.bedgraph` format where each position is a count of Tn5 intervals at each genomic location.
2) Convert Tn5 `.bedgraph` to `.bigwig` that is normalized by sequencing depth
### Convert Tn5 `.bed` to `.bedgraph` of read-depth normalized counts
The blacklisted Tn5 counts file is used as input to `bedtools genomecov` with the `-bg` and `-scale` flags.
The `-bg` flag will output the counts in bedgraph format where the 4th column in the file corresponds to the Tn5 count.
Briefly, `.bedgraph` files are like `.bed` files, but represent ranges of values.
For example:
```bash
> bedtools genomecov -i SRR14103347_Tn5_slop20_blacklisted.bed \
-g hg38.chrom.sizes \
-bg > SRR14103347_Tn5_slop20_blacklisted.bedgraph
> head SRR14103347_Tn5_slop20_blacklisted.bedgraph
chr1 792554 792595 1
chr1 792617 792658 1
chr1 792838 792879 1
chr1 792884 792966 1
chr1 792982 793023 1
chr1 793050 793075 1
chr1 793075 793091 2
chr1 793091 793116 1
chr1 793213 793220 1
chr1 793220 793243 2
```
The `-scale` flag will normalize the signal at each position. We want to normalize the count by the # of mapped reads.
#### Calculating scale factor
The below sections walk through how to calculate the scale factor for the `-scale` flag.
##### Count the number of mapped reads
You can get the number of mapped reads from the `.bam` file used as input to the [`bedtools bamtobed`](ATAC-seq-Data-Processing.md#convert-tn5-bed-to-bedgraph-of-read-depth-normalized-counts) function. The `-c` flag counts the reads. The [`-F 260`](https://broadinstitute.github.io/picard/explain-flags.html) flag is used to exclude unmapped and non-primary alignments.
```bash
> mapped_reads=$(samtools view -c -F 260 SRR14103347_final.bam)
> echo $mapped_reads
66223862
```
##### Calculate the scale factor
The scale factor is equal to each count divided by the mapped reads. Since `bedtools genomecov -scale` uses multiplication to apply to each value, we find the scale factor where,
$$scale\ factor\ =\ (\frac{1}{\#\ of\ mapped\ reads})*20,000,000$$
```bash
> reads_factor=$(bc -l <<< "1/${mapped_reads}")
> echo $reads_factor
.00000001510029723123
```
Scale each value by `20,000,000`. This is a value that was chosen for the maxATAC publication based ont he median sequencing depth of data available in 2021.
```bash
> rpm_factor=$(bc -l <<< "${reads_factor} * 20000000")
> echo ${rpm_factor}
.30200594462460000000
```
We use the value `.30200594462460000000` for `bedtools genomecov -bg -scale`.
#### Apply scale factor to `.bedgraph`
The final output should be a scaled count in `.bedgraph` format. These files must be sorted by position before being used with downstream software.
Example code:
```bash
> bedtools genomecov -i SRR14103347_Tn5_slop20_blacklisted.bed \
-g hg38.chrom.sizes \
-bg \
-scale .30200594462460000000 | \
LC_COLLATE=C sort -k1,1 -k2,2n > SRR14103347_Tn5_slop20_blacklisted_RP20M.bedgraph
```
Expected output:
```bash
> head SRR14103347_Tn5_slop20_blacklisted_RP20M.bedgraph
chr1 792554 792595 0.302006
chr1 792617 792658 0.302006
chr1 792838 792879 0.302006
chr1 792884 792966 0.302006
chr1 792982 793023 0.302006
chr1 793050 793075 0.302006
chr1 793075 793091 0.604012
chr1 793091 793116 0.302006
chr1 793213 793220 0.302006
chr1 793220 793243 0.604012
```
#### Convert `.bedgraph` to `.bigwig`
[Bigwig files are compressed wiggle files](https://genome.ucsc.edu/goldenPath/help/bigWig.html). These files represent spans of values that can be quickly accessed and visualized. We work primarily with the `.bigwig` signal tracks when using `maxatac`. We use a tool provided by UCSC called [`bedGraphToBigWig`](https://genome.ucsc.edu/goldenpath/help/bigWig.html).
The following code uses the read-depth normalized `.bedgraph` file of Tn5 count as input:
```bash
bedGraphToBigWig SRR14103347_Tn5_slop20_blacklisted_RP20M.bedgraph hg38.chrom.sizes SRR14103347_Tn5_slop20_blacklisted_RP20M.bw
```
This represents the same information in the `.bedgraph` file, but in a more compressed and disk friendly format.
```bash
> du -h SRR14103347_Tn5_slop20_blacklisted_RP20M.bedgraph
2.2G SRR14103347_Tn5_slop20_blacklisted_RP20M.bedgraph
> du -h SRR14103347_Tn5_slop20_blacklisted_RP20M.bw
472M SRR14103347_Tn5_slop20_blacklisted_RP20M.bw
```
#### Example script for converting `.bed` to read-depth normalized `.bigwig` files
This is an example workflow for read normalizing your inferred Tn5 bed files. This code will take the inferred Tn5 sites and build the genome wide coverage track using information in the bam header:
```bash
#!/bin/bash
########### RPM_normalize_Tn5_counts.sh ###########
# This script will take a BAM file and a BED file of insertion sites to create a bigwig file.
# This script requires that bedtools be installed and bedGraphToBigWig from UCSC
# INPUT1: BAM
# INPUT2: chromosome sizes file (hg38.chromsizes.txt)
# INPUT3: Millions factor (1000000 or 20000000)
# INPUT4: Name
# INPUT5: BED of Tn5 Sites
# OUTPUT: Sequencing depth normalized bigwig file scaled by # of reads of interest
# Set up Names
bedgraph=${4}.bg
bigwig=${4}.bw
mapped_reads=$(samtools view -c -F 260 ${1})
reads_factor=$(bc -l <<< "1/${mapped_reads}")
rpm_factor=$(bc -l <<< "${reads_factor} * ${3}")
echo "Scale factor: " ${rpm_factor}
# Use bedtools to obtain genome coverage
echo "Using Bedtools to convert BAM to bedgraph"
bedtools genomecov -i ${5} -g ${2} -bg -scale ${rpm_factor} | LC_COLLATE=C sort -k1,1 -k2,2n > ${bedgraph}
# Use bedGraphToBigWig to convert bedgraph to bigwig
echo "Using bedGraphToBigWig to convert bedgraph to bigwig"
bedGraphToBigWig ${bedgraph} ${2} ${bigwig}
rm ${bedgraph}
```
### Average biological replicate bigwig files
If your experiments include multiple biological replicates for the same condition, you can average the normalized signal using `maxatac average`. The detailed instructions about using `maxatac average` are found on the [average wiki page](https://github.com/MiraldiLab/maxATAC/wiki/average) and [readme](https://github.com/MiraldiLab/maxATAC/blob/main/docs/readme/average.md#Average).
Example command:
```bash
maxatac average --bigwigs SRX10474876_Tn5_slop20_blacklisted.bw SRX10474877_Tn5_slop20_blacklisted.bw --prefix IMR-90
```
___
## Min-max normalize signal tracks
During the initial development of maxATAC in 2020, we based our method on Leopard[@Li2021] which is a method developed for Dnase signal tracks used in the 2017 ENCODE-DREAM challenge. Our experience with Leopard introduced us to the importance of many key concepts of deep learning in genomics. This section focuses on the importance of signal normalization.
We found that one key aspect of model performance was data normalization.
However, the quality of the reference sample can influence to the distribution of scores.
Our performance on these models was near 0, with many odd observations.
Detailed instruction about min-max normalization implemented by maxATAC is documented on the [normalize wiki page](https://github.com/MiraldiLab/maxATAC/wiki/normalize) and [readme](https://github.com/MiraldiLab/maxATAC/blob/main/docs/readme/normalize.md#Normalize).
Example command:
```bash
maxatac normalize --signal GM12878_RP20M.bw --prefix GM12878 --output ./test --method min-max --max_percentile 99
```
___
## Calling peaks with `macs2`
We use `macs2` to identify regions of the genome that are enriched for signal compared to background, these regions are called "peaks" (islands, windows, hotspots). The peaks represent a binary label that is used to identify the enriched regions. Peak calling can be accomplished by many types of software, and the peaks themselves can be refined using additional downstream analysis. This is our approach for finding regions of the genome with high Tn5 cut site signal compared to background.
### Preparing the `macs2` inputs
Depending on your experimental design, `macs2` can be used with biological replicates as inputs in addition to a control group. We do not use a control input when calling peaks.
We will input the [SRR14103347_Tn5_slop20_blacklisted.bed](#generating-tn5-cut-site-signal-tracks-filtered-bam-to-bigwig) file. This bed file has been filtered, PCR deduplicated, converted to Tn5 cut sites, and had blacklisted regions removed.
### Tn5 specific settings for `macs2`
The `macs2` software has many parameters and settings that can be used to tune peak calling. We will take advantage of the `--shift`, `--ext_size`, and `-bed` input parameters.
Example command:
```bash
macs2 callpeaks -i -n --shift 0 --ext_size 40
```
___
## Overall QC
### Fragment distribution
```bash
bamPEFragmentSize -b {input} -p {threads} --binSize {params.bin_size} --blackListFileName {params.blacklist} --outRawFragmentLengths {output}
```
### Distribution of reads across chromosomes
### FRIP