For each step of the pre-processing, there are two scripts named after the tool used (e.g. fastqc
). One R script (e.g. fastqc.r
) which generates a script with bash commands and a bash wrapper script (e.g. fastqcWrapper.sh
) that submits the script previously generated by the R script via qsub to the HPC. Only the R and wrapper scripts are included on here.
For pre-processing steps where the type of reads (i.e. single end or pair end) was relevant (e.g. aligning to the genome), separate scripts were created and are denoted by SE and PE in the file name for single end and pair end respectively. The pre-processing scripts were used in the following order:
cat_files_SE/PE.r
- to merge experiment replicatesfastqc.r
/ fastqcWrapper.sh - for quality controltrimmomaticSE/PR.r
/ trimmommaticWrapperSE/PE.sh - to trim adapters sequencesbowtie_SE/PE.r
/ bowtieWrapperSE/PE.sh - to align the data to the genomesam2bamSE.r
/ sam2bamSEcmds.sh / sam2bamSEWrapper.sh - to convert the sam files to bam - this was optional and was not performed for all dada due to some issues with peak calling from bam filescallpeaks.r
/ callpeaksWrapper.sh - to call ChIP peaks
A combination of R and bash scripts were used for the ChIPanalyser analysis as follows:
generateTable.r
- to store all the necessary parameters for the analysis for each TF into one tableChIPanalPerformAnalysis.R
- to extract the parameters from the table and parse them, finally passing them to the performAnalysis.r functionperformAnalysis.r
- to perform the analysis; this can be called manually for indivisual TFs, or by the ChIpanalPerformAnalysis.sh script
For pre-processing steps where the type of reads (i.e. single end or pair end) was relevant (e.g. aligning to the genome), separate scripts were created and are denoted by SE and PE in the file name for single end and pair end respectively. The pre-processing scripts were used in the following order:
1_preprocessing_rscripts.R
- to generate all preprocessing scripts into.sh
files2_checkpreproc.R
- to check that the preprocessing has worked3_preproc_barplots.R
- to generate the statistic plots for the preprocessing
A combination of R scripts were used
4_ChIPanalyser_analysis.R
- to perform the model training and validation with ChIPanalyser5_optimal_data.R
- to extract optimal parameters from
For pre-processing steps where the type of reads (i.e. single end or pair end) was relevant (e.g. aligning to the genome), separate scripts were created and are denoted by SE and PE in the file name for single end and pair end respectively. The pre-processing scripts were used in the following order:
1_preProcessing_general.R
- to generate all preprocessing scripts into.sh
files for IMR90 cells2_barplots_general.R
- to generate the statistic plots for the preprocessing for IMR90 cells3_ATACseq_general.R
- to preprocess ATAC-seq data in IMR90 cells4_DNase_general.R
- to preprocess DNaseI-seq data in IMR90 cells5_MNase_general.R
- to preprocess MNase-seq data in IMR90 cells6_NOMe_general.R
- to preprocess NOMe-seq data in IMR90 cells7_calculateAccessibilityLevels.R
- calculate accessibility levels for different QDAs in IMR90 cells
A combination of R scripts were used
8_getMotif_general.R
- to extract motifs for the TFs9_ChIPanalyser_general.R
- to prepare to objects for ChIPanalyser algorithm10_executingChIPanalyser_general.R
- to perform the model training and validation with ChIPanalyser11_HEPG2_DNAaccessibility_preprocessing.R
- to preprocess ATAC-seq, DNaseI-seq and MNase-seq data in HepG2 cells12_validatingResultsWithCREB1.R
- analysis for CREB1 in HepG2 cells13_validatingResultsWithFOXA1.R
- analysis for FoxA1 in HepG2 cells14_validatingResultsWithGATA4.R
- analysis for GATA4 in HepG2 cells
A combination of R scripts were used
1_getFPKM_and_newLostMaintained.R
- to rescale the parameters for the models trained in K562 cells based on RNA-seq data.2_newLostMaintained_regions.R
- the regions that lose, gain and maintain DNA accessibility.3_generateAccessData.R
- generates accessibility datasets4_ChIPanalyser.R
- runs ChIPanalyser over the regions that lose, gain and maintain DNA accessibility5_her2_plots.R
- generate the plots.
Once all the above scripts are run, you can use plots_analysis_split.R
to generate almost all plots from the manuscript.