Home

Welcome to the iCLIP wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.

Clone the github repository to your local filesystem.

# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git

# Change your working directory to the iCLIP repo
cd iCLIP/

Load Snakemake to your environment.

# Recommend running snakemake>=5.19
module load snakemake/5.24.1

Preparing Configs and Manifests

There are three config requirements for this pipeline, that must be found in the /path/to/iCLIP/config directory. These files are:

cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned iCLIP repository; example: '/path/to/iCLIP/'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- container_dir: path to docker containers, and other programs; example '/path/to/container/'
- split_params: integer value indicating the number of sequences to split fastq file; recommend 3000 for small test files and 2000000 for study files
- novoalign_reference: selection of reference database ['hg38', 'mm10']; example: 'mm10'
- splice_aware: whether to run splice_aware part of the pipeline ['y', 'n']; example: 'y'
- splice_bp_length: length of splice index to use [50, 75, 150]; example: 75
- minimum_count: integer value, of the minimum number of matches to count as a peak; example: 2
index_config.yaml - this file will contain directory paths for index files that should follow the structure:
- organism:
  - std: '/path/to/index/'
  - spliceaware: -valuebp1: '/path/to/index1/' -valuebp2: '/path/to/index2/'

There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:

multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
```
An example multplex_manifest.tsv file:

file_name                 multiplex
SIM_iCLIP_S1.fastq        SIM_iCLIP_S1
SIM_iCLIP_S2.fastq        SIM_iCLIP_S2
```
samples_manifest.tsv
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
```
An example sample.tsv file:

multiplex       sample           group       barcode     adaptor
SIM_iCLIP_S1    Ro_Clip          CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S1    Control_Clip     CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S2    Ro_Clip2         CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S2    Control_Clip2    CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
```

Running Pipeline

Dry-Run

sh run_snakemake.sh dry-run

Execute pipeline on the cluster

sh run_snakemake.sh cluster

Execute pipeline locally

sh run_snakemake.sh local

Unlock directory (after failed partial run)

sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

log: slurm output files, copies of config and manifest files
qc: multiqc report for all samples
multiplexid

00_qc: fastqc reports for each sample

01_renamed: demultiplexed files, renamed to match sampleid

02_adaptor: sampleid files with adaptors removed

03_unzip: unzipped sampleid files, with adaptors removed

04_split: unzipped sampleid files, split into smaller files to increase processing speed

05_sam_splice: [splice_aware only] intermediate split sam files, aligned to reference

05_sam_genomic: [splice_aware only] converted transcriptome coordinates to genomic coordinates sam files, zipped

05_sam: split sam files, aligned to reference

06_reads: header, unique and multi-mapped text files

07_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files

07_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files

08_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files

09_bam_merged: merged sorted, indexe unique and multi-mapped bam files

10_dedup_bam: unsorted, sorted, and indexed deduplicated merged bam file

11_dedup_split: unsorted, sorted, and indexed split deduplicated into unique and multi-mapped files

12_bed: unique and multi-mapped bed files

13_peak: merged peak txt files

13_peak_anno: peak SAF annotation files

14_peak_count: unique, all, fraction, primary, and fraction/primary peak call txt files

15_gff: GTF and GFF3 peak files

Troubleshooting

Check your email for an email stating that the pipeline failed
Review the logs to determine what rule failed (logs are named by Snakemake rule)

cd /path/to/output/dir/log

Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)

Provide feedback

Saved searches