-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Samantha edited this page Dec 2, 2020
·
24 revisions
Welcome to the iCLIP wiki!
You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.
- Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git
# Change your working directory to the iCLIP repo
cd iCLIP/
- Load Snakemake to your environment.
# Recommend running snakemake>=5.19
module load snakemake/5.24.1
There are three config requirements for this pipeline, that must be found in the /path/to/iCLIP/config directory. These files are:
- cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
- snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned iCLIP repository; example: '/path/to/iCLIP/'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- container_dir: path to docker containers, and other programs; example '/path/to/container/'
- split_params: integer value indicating the number of sequences to split fastq file; recommend 3000 for small test files and 2000000 for study files
- novoalign_reference: selection of reference database ['hg38', 'mm10']; example: 'mm10'
- splice_aware: whether to run splice_aware part of the pipeline ['y', 'n']; example: 'y'
- splice_bp_length: length of splice index to use [50, 75, 150]; example: 75
- minimum_count: integer value, of the minimum number of matches to count as a peak; example: 2
- index_config.yaml - this file will contain directory paths for index files that should follow the structure:
- organism:
- std: '/path/to/index/'
- spliceaware: -valuebp1: '/path/to/index1/' -valuebp2: '/path/to/index2/'
- organism:
There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:
-
multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
An example multplex_manifest.tsv file: file_name multiplex SIM_iCLIP_S1.fastq SIM_iCLIP_S1 SIM_iCLIP_S2.fastq SIM_iCLIP_S2
-
samples_manifest.tsv
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
An example sample.tsv file: multiplex sample group barcode adaptor SIM_iCLIP_S1 Ro_Clip CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S1 Control_Clip CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S2 Ro_Clip2 CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S2 Control_Clip2 CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG
- Dry-Run
sh run_snakemake.sh dry-run
- Execute pipeline on the cluster
sh run_snakemake.sh cluster
- Execute pipeline locally
sh run_snakemake.sh local
- Unlock directory (after failed partial run)
sh run_snakemake.sh unlock
The following directories are created under the output_directory:
- log: slurm output files, copies of config and manifest files
- qc: multiqc report for all samples
- multiplexid
- 00_qc: fastqc reports for each sample
- 01_renamed: demultiplexed files, renamed to match sampleid
- 02_adaptor: sampleid files with adaptors removed
- 03_unzip: unzipped sampleid files, with adaptors removed
- 04_split: unzipped sampleid files, split into smaller files to increase processing speed
- 05_sam_splice: [splice_aware only] intermediate split sam files, aligned to reference
- 05_sam_genomic: [splice_aware only] converted transcriptome coordinates to genomic coordinates sam files, zipped
- 05_sam: split sam files, aligned to reference
- 06_reads: header, unique and multi-mapped text files
- 07_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
- 07_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
- 08_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
- 09_bam_merged: merged sorted, indexe unique and multi-mapped bam files
- 10_dedup_bam: unsorted, sorted, and indexed deduplicated merged bam file
- 11_dedup_split: unsorted, sorted, and indexed split deduplicated into unique and multi-mapped files
- 12_bed: unique and multi-mapped bed files
- 13_peak: merged peak txt files
- 13_peak_anno: peak SAF annotation files
- 14_peak_count: unique, all, fraction, primary, and fraction/primary peak call txt files
- 15_gff: GTF and GFF3 peak files
- Check your email for an email stating that the pipeline failed
- Review the logs to determine what rule failed (logs are named by Snakemake rule)
cd /path/to/output/dir/log
- Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)