All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog
- Extra step to calculate total entropy
- Updated clipper.cwl to "Clipper3" () and include the latest ENCODE annotations (GRCh38_v29e). Also removed pickle intermediates, although this has never made it to the final outputs.
- Updated star*.cwl STAR to version 2.7.6, fixes a bug that produces non-ascii characters
- The core pipeline (wf_get_peaks_scatter_se.cwl and wf_get_peaks_scatter_pe.cwl) should now be fully portable on AWS.
- Slight modifications to README (updated references)
- Updated adapter examples (was missing one base in the last adapter)
- Added docker requirement definitions to most commandlinetools.
- Added the following companion workflows:
- wf_encode_se_full_nostats "full encode workflow (eCLIP + repeat mapping + region normalization)" minus umi_tools --stats (to save memory)
- wf_encode_se_full_scatter_nostats "full encode workflow" minus umi_tools --nostats (multiple samples)
- Added the following commandlinetools:
- fastqc.cwl
- Added the following subworkflows to the main workflow:
- wf_fastqc.cwl essentially fastqc.cwl + rename.cwl (so fastqc files won't override each other)
- Added a 'blacklist_file' required param to the following workflows
- wf_get_peaks_scatter_se_nostats.cwl
- wf_get_peaks_trim_partial_scatter_se.cwl
- wf_get_peaks_trim_partial_se.cwl
- Updated workflows to report uniquely-named fastqc reports so they don't override each other.
- (unused in main pipeline) convert_ReadsByLoc_combined_significancecalls.pl now matches current region normalization script
- (unused in main pipeline) duplicate_removal_inline_paired_count_region_other_reads_SE.pl now matches current repeat element scripts
- (unused in main pipeline) split_bam_to_subfiles_SE.pl now matches current repeat element scripts
- Version bumped to 0.5.0
- Added the following steps to the main single-end pipeline:
- sort_bed (sorts input normalized bed file)
- blacklist remove (removes blacklisted regions from peak file)
- bed to narrowPeak (converts peak bed file to narrowPeak format)
- fix bed ("fixes" a peak bed file format such that it is compatible with bedToBigBed)
- bed to bigbed (calls bedToBigBed to convert peak bed file to bigBed format)
- Added a 'nostats' workflow in 'wf/' to optionally run the pipeline without requiring umi_tools stats generation. This dramatically cuts down on runtime/mem reqs
- Added pre/post processing scripts (annotate_peaks_bedformat_wproxdistal_lncRNA.pl & generate_adaptertrim_fasta.ipynb)
- annotate_peaks_bedformat_wproxdistal_lncRNA.pl (perl script that annotates bed files)
- generate_adaptertrim_fasta.ipynb (jupyter notebook that generates fasta files w/ partial adapter sequences to trim)
- YAML metadata changes slightly to account for each dataset to potentially have its own adapter sequences
- There is some work done to make the SE pipeline outputs deterministic. Outputs should be the same every time.
- Introducing a "wf_encode_full" workflow that combines the peak calling workflow, the repeat mapping workflow (hg19 only), and region-level normalization workflow
- The previous manifests (eCLIP-0.2.2) for eCLIP_pairedend and eCLIP_singleend should still work.
- gzip step for all fastq files
- added
arguments: ["--random-seed", "1"]
to barcodecollapse_se and demux_se definitions to decrease randomness in umi_tools outputs - added an "wf_encode_se_full" and "wf_encode_se_full_scatter" cwl definitions to run 1) peak finding, 2) region level normalization, 3) repeat mapping for SE reads.
- region normalization subworkflow (regionnormalize/) cwl definitions to incorporate region level normalization
- repeat mapping subworkflow (repmap/) cwl definitions to incorporate repeat mapping
- makebigwigs script is now split into _PE and _SE due to strand flipping
- repeat-mapped reads now are named dataset.readname.umi.r1.repeat-mapped.bam (instead of dataset.readname.umi.r1TrTr.sorted.STARAligned.out.bam)
- repeat-unmapped reads are now named dataset.readname.umi.r1.repeat-unmapped.sorted.fq (instead of dataset.readname.umi.r1TrTr.sorted.STARUnmapped.out.sorted.fq.gz)
- genome-mapped reads now are named dataset.readname.umi.r1.genome-mapped.bam (instead of dataset.readname.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.bam)
- wf_trim_and_map_se.cwl now outputs gzipped X_output_trim_first and X_output_trim_again fastq files.