Skip to content

Latest commit

 

History

History
157 lines (149 loc) · 11.6 KB

README.md

File metadata and controls

157 lines (149 loc) · 11.6 KB

ATAC-Seq Pipeline in Epigen-UCSD

This pipeline is largely based on Encode pipeline, with few small changes.:

  1. Enabled handling fastq.bz files
  2. Enabled to run by lib
  3. Configged to the epigen-group@TSCC

More detailed documentation can be found on this gitbook.

List of all parameters

== atac pipeline settings
	-type <string>                   : Type of the pipeline. atac-seq or chip-seq (default: atac-seq).
	-chip_seq <bool>                 : Chip-Seq (no tn5 shifting).
	-trimmed_fastq <bool>            : Skip fastq-trimming stage.
	-align <bool>                    : Align only (no MACS2 peak calling or IDR or ataqc analysis).
	-subsample_xcor <string>         : # reads to subsample for cross corr. analysis (default: 25M).
	-subsample <string>              : # reads to subsample exp. replicates. Subsampled tagalign will be used for steps downstream (default: 0; no subsampling).
	-true_rep <bool>                 : No pseudo-replicates.
	-no_ataqc <bool>                 : No ATAQC
	-no_xcor <bool>                  : No Cross-correlation analysis.
	-csem <bool>                     : Use CSEM for alignment.
	-smooth_win <string>             : Smoothing window size for MACS2 peak calling (default: 150).
	-idr_thresh <real>               : IDR threshold : -log_10(score) (default: 0.1).
	-old_trimmer <bool>              : Use legacy trim adapters (trim_galore and trimAdapter.py).
	-ENCODE3 <bool>                  : Force to use parameter set (-smooth_win 73 -idr_thresh 0.05 -multimapping 4) for ENCODE3.
	-ENCODE <bool>                   : Force to use parameter set (-smooth_win 73 -idr_thresh 0.05 -multimapping 4) for ENCODE.
	-no_browser_tracks <bool>        : Disable generation of genome browser tracks (workaround for bzip2 shared library issue).
	-overlap_pval_thresh <real>      : p-val threshold for overlapped peaks (default: 0.01).
	-macs2_pval_thresh <real>        : MACS2 p-val threshold for calling peaks (default: 0.1).
	-macs2_pval_thresh_bw <real>     : MACS2 p-val threshold for generating BIGWIG signal tracks (default: 0.1).
	-enable_idr <bool>               : Enable IDR on called peaks.
== configuration file settings
	-c <string>                      : Configuration file path.
	-env <string>                    : Environment file path.
== parallelization settings
	-no_par <bool>                   : Serialize all tasks (individual tasks can still use multiple threads up to '-nth').
	-nth <int>                       : Maximum # threads for a pipeline. (default: 8).
== cluster/system/resource settings
	-wt <string>                     : Walltime for all single-threaded tasks (example: 8:10:00, 3h, 3600, default: 5h50m, 5:50:00).
	-memory <string>                 : Maximum memory for all single-threaded tasks (equivalent to '-mem', example: 4.5G, 1024M, default: 7G).
	-use_system <string>             : Force to use a system (equivalent to 'bds -s [SYSTEM_NAME] ...', any system defined in bds.config can be used).
	-nice <int>                      : Set process priority for all tasks (default: 0; -20 (highest) ~ 19 (lowest) ).
	-retrial <int>                   : # of Retrial for failed tasks (default: 0).
	-q <string>                      : Submit tasks to a specified cluster queue.
	-unlimited_mem_wt <bool>         : Use unlimited max. memory and walltime.
== shell environment settings
	-mod <string>                    : Modules separated by ; (example: "bowtie/2.2.4; bwa/0.7.7; picard-tools/1.92").
	-shcmd <string>                  : Shell commands separated by ;. Shell var. must be written as ${VAR} not as $VAR (example: "export PATH=${PATH}:/usr/test; VAR=test").
	-addpath <string>                : Path separated by ; or : to be PREPENDED to \$PATH (example: "/bin/test:${HOME}/utils").
	-conda_env <string>              : Anaconda Python (or Miniconda) environment name for all softwares including Python2.
	-conda_env_py3 <string>          : Anaconda Python (or Miniconda) environment name for Python3.
	-conda_bin_dir <string>          : Anaconda Python (or Miniconda) bin directory.
	-cluster_task_min_len <int>      : Minimum length for a cluster job in seconds (dealing with NFS delayed write, default: 60).
	-cluster_task_delay <int>        : Constant delay for every job in seconds (dealing with NFS delayed write, default: 0).
== output/title settings
	-out_dir <string>                : Output directory (default: out).
	-title <string>                  : Prefix for HTML report and outputs without given prefix.
== species settings
	-species <string>                : Species. need to specify '-species_file' too if you have not installed genome database with 'install_genome_data.sh'.
	-species_file <string>           : Species file path.
	-species_browser <string>        : Species name in WashU genome browser.
	-ref_fa <string>                 : Reference genome sequence fasta.
	-chrsz <string>                  : Chromosome sizes file path (use fetchChromSizes from UCSC tools).
	-blacklist <string>              : Blacklist bed.
	-seq_dir <string>                : Reference genome sequence directory path (where chr*.fa exist).
== ENCODE accession settings
	-ENCODE_accession <string>       : ENCODE experiment accession ID (or dataset).
	-ENCODE_award_rfa <string>       : ENCODE award RFA (e.g. ENCODE3).
	-ENCODE_assay_category <string>  : ENCODE assay category.
	-ENCODE_assay_title <string>     : ENCODE assay title.
	-ENCODE_award <string>           : ENCODE award (e.g. /awards/U41HG007000/).
	-ENCODE_lab <string>             : Lab (e.g. /labs/anshul-kundaje/)
	-ENCODE_assembly <string>        : hg19, GRCh38, mm9, mm10.
	-ENCODE_alias_prefix <string>    : Alias prefix, Alias = alias_prefix: + filename + alias_suffix
	-ENCODE_alias_suffix <string>    : Alias suffix, Alias = alias_prefix: + filename + alias_suffix
== report settings
	-url_base <string>               : URL base for output directory.
	-viz_genome_coord <string>       : WashU genome browser genome coordinate (e.g. chr7:27117661-27153380).
== fastq input definition :
        Single-ended : For replicate '-fastq[REP_ID]', For control '-ctl_fastq[REP_ID]'
        Paired end : For replicate '-fastq[REP_ID]_[PAIR_ID]', For control '-ctl_fastq[REP_ID]_[PAIR_ID]'
== bam input (raw or filtered) definition :
        Raw bam : For replicate '-bam[REP_ID]', For control '-ctl_bam[REP_ID]'.
        Filtered bam : For replicate '-filt_bam[REP_ID]', For control '-ctl_filt_bam[REP_ID]'.
== tagalign input definition :
        For replicate '-tag[REP_ID]', For control '-ctl_tag[REP_ID]'.
== narrow peak input definition : 
        For true replicates, use '-peak1' and '-peak2',
        For pooled replicates, use '-peak_pooled',
        For two PR (self-pseudo-replicates), use '-peak[REP_ID]_pr1' and '-peak[REP_ID]_pr2'
        For two PPR (pooled pseudo-replicates), use '-peak_ppr1' and '-peak_ppr2'
== input endedness settings (SE or PE) :
	-se <bool>                       : Singled-ended data set. To specify it for each replicate, '-se[REP_ID]' for exp. reps, '-ctl_se[CTL_ID]' for control.
	-pe <bool>                       : Paired end data set. To specify it for each replicate, '-pe[REP_ID]' for exp. reps, '-ctl_pe[CTL_ID]' for controls.
== adapter sequence definition :
        Single-ended : For replicate '-adapter[REP_ID]'
        Paired end : For replicate '-adapter[REP_ID]_[PAIR_ID]'
== align multimapping settings
	-multimapping <int>              : # alignments reported for multimapping (default: 0).
== align bowtie2 settings (requirements: -bwt2_idx)
	-bwt2_idx <string>               : Bowtie2 index (full path prefix of *.1.bt2 file).
	-scoremin_bwt2 <string>          : Replacement --score-min for bowtie2.
	-wt_bwt2 <string>                : Walltime for bowtie2 (default: 47h, 47:00:00).
	-mem_bwt2 <string>               : Max. memory for bowtie2 (default: 12G).
	-extra_param_bwt2 <string>       : Extra parameter for bowtie2.
== adapter trimmer settings
	-adapter_err_rate <string>       : Maximum allowed adapter error rate (# errors divided by the length of the matching adapter region, default: 0.10).
	-min_trim_len <int>              : Minimum trim length for cutadapt -m, throwing away processed reads shorter than this (default: 5).
	-wt_trim <string>                : Walltime for adapter trimming (default: 23h, 23:00:00).
	-mem_trim <string>               : Max. memory for adapter trimming (default: 12G).
== postalign bam settings
	-mapq_thresh <int>               : Threshold for low MAPQ reads removal (default: 30).
	-rm_chr_from_tag <string>        : Perl style reg-ex to exclude reads from tag-aligns. (example: 'other|ribo|mito|_', '_', default: blank)
	-no_dup_removal <bool>           : No dupe removal when filtering raw bam.
	-wt_dedup <string>               : Walltime for post-alignment filtering (default: 23h, 24:00:00).
	-mem_dedup <string>              : Max. memory for post-alignment filtering (default: 12G).
	-dup_marker <string>             : Dup marker for filtering mapped reaads in BAMs: picard or sambamba (default: picard).
	-use_sambamba_markdup <bool>     : Use sambamba markdup instead of Picard MarkDuplicates (default: false).
== postalign bed/tagalign settings
	-mem_shuf <string>               : Max. memory for UNIX shuf (default: 12G).
	-fraglen0 <bool>                 : (LEGACY PARAM) Set predefined fragment length as zero for cross corr. analysis (add -speak=0 to run_spp.R).
	-speak_xcor <int>                : Set user-defined cross-corr. peak strandshift (-speak= in run_spp.R). Use -1 to disable (default: -1).
	-extra_param_xcor <string>       : Set extra parameters for run_spp.R (cross-corr. analysis only).
== postalign bed/tagalign settings
	-mem_shuf <string>               : Max. memory for UNIX shuf (default: 12G).
	-fraglen0 <bool>                 : (LEGACY PARAM) Set predefined fragment length as zero for cross corr. analysis (add -speak=0 to run_spp.R).
	-speak_xcor <int>                : Set user-defined cross-corr. peak strandshift (-speak= in run_spp.R). Use -1 to disable (default: -1).
	-extra_param_xcor <string>       : Set extra parameters for run_spp.R (cross-corr. analysis only).
== callpeak macs2 settings (requirements: -chrsz -gensz)
	-gensz <string>                  : Genome size; hs for human, mm for mouse.
	-wt_macs2 <string>               : Walltime for MACS2 (default: 23h, 23:00:00).
	-mem_macs2 <string>              : Max. memory for MACS2 (default: 15G).
	-extra_param_macs2 <string>      : Extra parameters for macs2 callpeak.
== callpeak naive overlap settings
	-nonamecheck <bool>              : bedtools intersect -nonamecheck (bedtools>=2.24.0, use this if you get bedtools intersect naming convenction warnings/errors).
== callpeak etc settings
	-npeak_filt <int>                : # top peaks filtered from a narrow peak files (default: 500000).
== IDR settings
	-idr_suffix <bool>               : Append IDR threshold to IDR output directory.
== ATAQC settings
	-tss_enrich <string>             : TSS enrichment bed for ataqc.
	-dnase <string>                  : DNase bed (open chromatin region file) for ataqc.
	-prom <string>                   : Promoter bed (promoter region file) for ataqc.
	-enh <string>                    : Enhancer bed (enhancer region file) for ataqc.
	-reg2map <string>                : Reg2map (file with cell type signals) for ataqc.
	-reg2map_bed <string>            : Reg2map_bed (file of regions used to generate reg2map signals) for ataqc.
	-roadmap_meta <string>           : Roadmap metadata for ataqc.
	-mem_ataqc <string>              : Max. memory for ATAQC (default: 20G).
	-wt_ataqc <string>               : Walltime for ATAQC (default: 47h, 47:00:00).