Skip to content

nrzabet/human_TF_analysis

Repository files navigation

1. K562 analysis

1.1 Pre-processing

For each step of the pre-processing, there are two scripts named after the tool used (e.g. fastqc). One R script (e.g. fastqc.r) which generates a script with bash commands and a bash wrapper script (e.g. fastqcWrapper.sh) that submits the script previously generated by the R script via qsub to the HPC. Only the R and wrapper scripts are included on here.

For pre-processing steps where the type of reads (i.e. single end or pair end) was relevant (e.g. aligning to the genome), separate scripts were created and are denoted by SE and PE in the file name for single end and pair end respectively. The pre-processing scripts were used in the following order:

  1. cat_files_SE/PE.r - to merge experiment replicates
  2. fastqc.r / fastqcWrapper.sh - for quality control
  3. trimmomaticSE/PR.r / trimmommaticWrapperSE/PE.sh - to trim adapters sequences
  4. bowtie_SE/PE.r / bowtieWrapperSE/PE.sh - to align the data to the genome
  5. sam2bamSE.r / sam2bamSEcmds.sh / sam2bamSEWrapper.sh - to convert the sam files to bam - this was optional and was not performed for all dada due to some issues with peak calling from bam files
  6. callpeaks.r / callpeaksWrapper.sh - to call ChIP peaks

1.2 ChIPanalyser analysis

A combination of R and bash scripts were used for the ChIPanalyser analysis as follows:

  1. generateTable.r - to store all the necessary parameters for the analysis for each TF into one table
  2. ChIPanalPerformAnalysis.R - to extract the parameters from the table and parse them, finally passing them to the performAnalysis.r function
  3. performAnalysis.r - to perform the analysis; this can be called manually for indivisual TFs, or by the ChIpanalPerformAnalysis.sh script

2. mm10 analysis

2.1 Pre-processing

For pre-processing steps where the type of reads (i.e. single end or pair end) was relevant (e.g. aligning to the genome), separate scripts were created and are denoted by SE and PE in the file name for single end and pair end respectively. The pre-processing scripts were used in the following order:

  1. 1_preprocessing_rscripts.R - to generate all preprocessing scripts into .sh files
  2. 2_checkpreproc.R - to check that the preprocessing has worked
  3. 3_preproc_barplots.R - to generate the statistic plots for the preprocessing

2.2 ChIPanalyser analysis

A combination of R scripts were used

  1. 4_ChIPanalyser_analysis.R - to perform the model training and validation with ChIPanalyser
  2. 5_optimal_data.R - to extract optimal parameters from

3. IMR90 and HepG2 analysis

2.1 Pre-processing

For pre-processing steps where the type of reads (i.e. single end or pair end) was relevant (e.g. aligning to the genome), separate scripts were created and are denoted by SE and PE in the file name for single end and pair end respectively. The pre-processing scripts were used in the following order:

  1. 1_preProcessing_general.R - to generate all preprocessing scripts into .sh files for IMR90 cells
  2. 2_barplots_general.R - to generate the statistic plots for the preprocessing for IMR90 cells
  3. 3_ATACseq_general.R - to preprocess ATAC-seq data in IMR90 cells
  4. 4_DNase_general.R - to preprocess DNaseI-seq data in IMR90 cells
  5. 5_MNase_general.R - to preprocess MNase-seq data in IMR90 cells
  6. 6_NOMe_general.R - to preprocess NOMe-seq data in IMR90 cells
  7. 7_calculateAccessibilityLevels.R - calculate accessibility levels for different QDAs in IMR90 cells

2.2 ChIPanalyser analysis

A combination of R scripts were used

  1. 8_getMotif_general.R - to extract motifs for the TFs
  2. 9_ChIPanalyser_general.R - to prepare to objects for ChIPanalyser algorithm
  3. 10_executingChIPanalyser_general.R - to perform the model training and validation with ChIPanalyser
  4. 11_HEPG2_DNAaccessibility_preprocessing.R - to preprocess ATAC-seq, DNaseI-seq and MNase-seq data in HepG2 cells
  5. 12_validatingResultsWithCREB1.R - analysis for CREB1 in HepG2 cells
  6. 13_validatingResultsWithFOXA1.R - analysis for FoxA1 in HepG2 cells
  7. 14_validatingResultsWithGATA4.R - analysis for GATA4 in HepG2 cells

3. MCF10A analysis

2.1 analysis

A combination of R scripts were used

  1. 1_getFPKM_and_newLostMaintained.R - to rescale the parameters for the models trained in K562 cells based on RNA-seq data.
  2. 2_newLostMaintained_regions.R - the regions that lose, gain and maintain DNA accessibility.
  3. 3_generateAccessData.R - generates accessibility datasets
  4. 4_ChIPanalyser.R - runs ChIPanalyser over the regions that lose, gain and maintain DNA accessibility
  5. 5_her2_plots.R - generate the plots.

4. Plots for the paper

Once all the above scripts are run, you can use plots_analysis_split.R to generate almost all plots from the manuscript.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published