Skip to content

Latest commit

 

History

History
307 lines (217 loc) · 15.5 KB

File metadata and controls

307 lines (217 loc) · 15.5 KB

3.ChIP-seq

本章主要介绍 ChIP-seq 相关分析方法,包括 peak calling 和 motif analysis。

染色质免疫共沉淀测序(Chromatin immunoprecipitation sequencing,简称 ChIP-seq)被用于分析蛋白质与 DNA 的交互作用。常见的 ChIP-seq 实验主要研究转录因子和组蛋白修饰如何通过与DNA相互作用来调控基因表达,从而影响表型。ChIP-seq 对于充分了解许多生物过程和疾病状态至关重要。

1) Pipeline

2) Data Structure

2a) getting software & data

方法1: 使用docker

docker images的下载链接如附表所示,docker中我们已经准备好所需文件,例如 .bam 文件位于 Docker 中的 /home/test/chip-seq/input

方法2: 自行下载和安装

  1. Install software:HOMER
  2. Dowload data: 下载链接

我们使用 GSE61210 中的两个样本:GSM1499619input.bam)和GSM1499607ip.bam)。 用户也可以参考 这里 从 CHIP-seq 数据生成自己的 .bam 文件。

2b) input

Format Description Notes
.bam 将CHIP-seq的 Reads 比对到参考基因组 -

2c) output

2c.1) 核心输出文件如下所示:

File format File description
Peak calling peak file each row contatins information of a peak
Motif analysis homerResults.html de novo motif table in HTML format

peak table in peak file

# Column Headers:
#PeakID chr     start   end     strand  Normalized Tag Count    focus ratio   findPeaks Score  Total Tags      Control Tags (normalized to IP Experiment)    Fold Change vs Control   p-value vs Control      Fold Change vs Local    p-value vs Local       Clonal Fold Change
chrIII-1        chrIII  78346   78578   +       69987.1 0.862   5971.000000   5966.0   201.6   29.60   0.00e+00        26.54   0.00e+00        0.50
chrIII-2        chrIII  133     365     +       61226.9 0.775   5364.000000   5227.0   116.3   44.93   0.00e+00        59.92   0.00e+00        0.61
chrI-1  chrI    141663  141895  +       41225.6 0.854   3515.000000     3514.0169.1    20.78   0.00e+00        17.09   0.00e+00        0.50
chrII-1 chrII   165145  165377  +       35334.5 0.845   3015.000000     3018.0171.1    17.64   0.00e+00        14.47   0.00e+00        0.50
chrII-2 chrII   555827  556059  +       34817.1 0.790   2973.000000     2970.0159.6    18.61   0.00e+00        10.12   0.00e+00        0.50

图1. motif table in homerResults.html

2c.2) detailed description of Peak calling output

peak file contains the following columns

column information description
1 PeakID a unique name for each peak
2 chr chromosome where peak is located
3 starting position of peak
4 ending position of peak
5 Strand (+/-)
6 Normalized Tag Counts number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user)
7 Focus Ratio fraction of tags found appropriately upstream and downstream of the peak center
8 Peak score position adjusted reads from initial peak region - reads per position may be limited
9 total Tags number of tags found at the peak
10 Control Tags normalized to IP Experiment
11 Fold Change vs Control putative peaks have 4-fold more normalized tags in the target experiment than the control
12 p-value vs Control HOMER uses the poisson distribution to determine the chance that the differences in tag counts are statistically significant (sequencing-depth dependent), requiring a cumulative poisson p-value of 0.0001
13 Fold Change vs Local HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region
14 p-value vs Local the comparison must also pass a poisson p-value threshold of 0.0001
15 Clonal Fold Change The fold threshold can be set with the -C <#> option (default: -C 2), if this ratio gets too high, the peak may arise from expanded repeats, and should be discarded

2c.3) detailed description of Motif analysis output

Detailed output files of Motif analysis will produce many files, we only explain the main output -- homerResults.html in above. Here we will briefly introduce other files.

file name description
homerMotifs.all.motifs Simply the concatenated file composed of all the homerMotifs.motifs<#> files.
motifFindingParameters.txt command used to execute findMotifsGenome.pl.
knownResults.txt text file containing statistics about known motif enrichment (open in EXCEL).
seq.autonorm.tsv autonormalization statistics for lower-order oligo normalization.
homerResults.html HTML formatted output of de novo motif finding:
knownResults.html HTML formatted output of known motif finding.
knownResults/ directory contains files for the knownResults.html webpage, including known<#>.motif files for use in finding specific instance of each motif.

3) Running Steps

首先进入到容器(在自己电脑的 Terminal 中运行,详情请参见 这里):

docker exec -it bioinfo_tsinghua bash

以下步骤均在 /home/test/chip-seq/ 下进行:

cd /home/test/chip-seq/

准备输出目录

mkdir output

3a) Peak Calling

The common call peak software are HOMER and MACS, here we mainly introduce the usage of HOMER. You can get manual for MACS here.

HOMER contains a program called findPeaks that performs all of the peak calling analysis. Before we use findPeaks to call peak, we need to convert our .bam file into tag file by using makeTagDirectory:

makeTagDirectory input/ip    input/ip.part.bam
makeTagDirectory input/input input/input.part.bam

After this step, input/ip & input/input will contain several .tags.tsv files, as well as a file named tagInfo.txt. This file contains information about your sequencing run, including the total number of tags considered. This file is used by later peak-calling programs to quickly reference information about the experiment. Then we call peak by using these tag file:

findPeaks input/ip/ -style factor -o output/part.peak -i input/input/

Important parameters

parameter meaning
-style Specialized options for specific analysis strategies, such as factor (transcription factor ChIP-Seq) and histone (histone modification ChIP-Seq).
-o File name for to output peaks, default: stdout.
-i Input tag directory, experiment to use as IgG/Input/Control.

输出文件为 /home/test/chip-seq/output/part.peak, 示例如下

#PeakID chr     start   end     strand  Normalized Tag Count    focus ratio     findPeaks Score Total TagControl Tags (normalized to IP Experiment)       Fold Change vs Control  p-value vs Control      Fold Change vs Local      p-value vs Local        Clonal Fold Change
chrIII-1        chrIII  78346   78578   +       69987.1 0.862   5971.000000     5966.0  201.6   29.60   0.00e+00  26.54   0.00e+00        0.50
chrIII-2        chrIII  133     365     +       61226.9 0.775   5364.000000     5227.0  116.3   44.93   0.00e+00  59.92   0.00e+00        0.61
chrI-1  chrI    141663  141895  +       41225.6 0.854   3515.000000     3514.0  169.1   20.78   0.00e+00 17.09    0.00e+00        0.50
chrII-1 chrII   165145  165377  +       35334.5 0.845   3015.000000     3018.0  171.1   17.64   0.00e+00 14.47    0.00e+00        0.50
chrII-2 chrII   555827  556059  +       34817.1 0.790   2973.000000     2970.0  159.6   18.61   0.00e+00 10.12    0.00e+00        0.50
chrIII-3        chrIII  163527  163759  +       31266.1 0.826   2662.000000     2670.0  186.0   14.35   0.00e+00  14.16   0.00e+00        0.51

3b) Motif Analysis

HOMER contains a program called findMotifsGenome.pl that can find enriched motifs in ChIP-Seq peaks

findMotifsGenome.pl output/part.peak sacCer2 output/part.motif.output -len 8

Important parameters

  • Region Size (-size <#>, -size <#>,<#>, -size given, default: 200) The size of the region used for motif finding is important. If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea。
  • Motif length (-len <#> or -len <#>,<#>,..., default 8,10,12) In general, it's best to try out enrichment with shorter lengths (i.e. less than 15) before trying longer lengths.
  • Number of motifs to find (-S <#>, default 25) Specifies the number of motifs of each length to find. 25 is already quite a bit.

最重要的输出文件为 /home/test/chip-seq/output/part.motif.output/homerResults.html, 示例输出参见 这里

4) Tips/Utilities

4a) Preparation .bam from ChIP-seq data

If you want to get these two .bam files by yourself, you can follows these steps.

  1. download data

    1. The fastq data for yeast ChIP-seq was downloaded from GSE61210 .
    2. Input data was downloaded from GSM1499619;
    3. IP data was downloaded from GSM1499607.
  2. build yeast bowtie index

    Yeast sacCer2 genome data was downloaded from UCSC http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/bigZips/chromFa.tar.gz.

    Index was build with commad:

    tar -xvf chromfa.tar.gz
    cat *.fa >yeast.allchrom.fa
    mkdir bowtie_index_yeast
    bowtie-build yeast.allchrom.fa bowtie_index_yeast/sacCer2
  3. mapping

    bowtie -p 4  -m 1  -v 3  --best --strata bowtie_index_yeast/sacCer2 \
        -q input/ip.fastq -S input/ip.sam
  4. sampling

    As the .sam file is too big for tutorial example, so we selected parts of them as example file.

    samtools sort input/ip.sam >input/ip.sorted.bam
    samtools index input/ip.sorted.bam
    samtools view input/ip.sorted.bam chrI chrII chrIII -b >input/ip.part.bam

4b) peak file header

The top portion of the peak file will contain parameters and various analysis information

图2. peak file 元信息

Two fields need special notification:

  • genome size represents the total effective number of mappable bases in the genome;
  • Approximate IP effeciency describes the fraction of tags found in peaks versus genomic background. This provides an estimate of how well the ChIP worked. Certain antibodies like H3K4me3, ERa, or PU.1 will yield very high IP efficiencies (>20%), while most rand in the 1-20% range. Once this number dips below 1% it's a good sign the ChIP didn't work very well and should probably be optimized.

4c) peak calling using MACS

MACS can also be used to call peaks (the program is already installed in Docker).

macs2 callpeak -t input/ip.part.bam -c input/input.part.bam --outdir output/macs_peak \
    --name=yeast_macs_p05 --format=BAM --gsize=1.2e7 --tsize=50 --pvalue=1e-5

The main output is /home/test/chip-seq/output/macs_peak/yeast_macs_p05_peaks.xls, which is a tabular file containing information about called peaks. You can open it in excel and sort/filter using excel functions. Information include:

  • chromosome name
  • start position of peak
  • end position of peak
  • length of peak region
  • absolute peak summit position
  • pileup height at peak summit, -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
  • fold enrichment for this peak summit against random Poisson distribution with local lambda, -log10(qvalue) at peak summit

4d) Visualize peaks

Advanced users: for a better ChIP-seq visualization, ChIPseeker was used in R software.

# Prepare the environment of software
rm(list = ls())
options(stringsAsFactors = FALSE)

# Library packages
library("RMariaDB")
library("rtracklayer")
library("ChIPseeker")
library("org.Sc.sgd.db")
library("clusterProfiler")
library("GenomicFeatures")
library("ggplot2")

# Display available genomes list
ucscGenomes()[ , "db"]
# Display the list of table known to work 
supportedUCSCtables()
# Retrieve full datasets (responds to genome ID of fasta) for Yeast from UCSC
TxDb.Ssaccer.USUC.sacCer3.ensGene <- makeTxDbFromUCSC(genome = "sacCer2", tablename = "ensGene")

# Find the position of peak in WGS
txdb <- TxDb.Ssaccer.USUC.sacCer2.ensGene
# Read the exported peak file from MACS
yeast <- readPeakFile("=yeast_macs_p05_peaks.narrowPeak")
covplot(yeast, weightCol = 5)

# ChIP peaks binding TSS region
promoter <- getPromoters(TxDb = txdb, upstream = 3000, downstream = 3000)
tagMatrix <- getTagMatrix(peak = yeast, windows = promoter)
tagHeatmap(tagMatrix = tagMatrix, xlim = c(-3000, 3000), color = "red")

# Average profile of ChIP peaks binding to TSS region
plotAvgProf(tagMatrix, xlim = c(-3000, 3000), conf = 0.95, resample = 1000,
            xlab = "Genomone Region (5'->3')", ylab = "Read Count Frequency")

# Peaks annotation
peakAnno <- annotatePeak(yeast, tssRegion = c(-3000, 3000), TxDb = txdb, annoDb = "org.Sc.sgd.db")

# Visualize the genome annotation
plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
vennpie(peakAnno)
upsetplot(peakAnno)

# Visualize distribution of TF-binding loci relative to TSS
plotDistToTSS(peakAnno)

5) Homework and more

  1. 解释 findPeaksfindMotifsGenome.pl 主要参数的含义。

  2. 提交文件: 提交Snf1蛋白在DNA上结合的 peak的位置, fold change, p-value等信息 和 motif 及p-value信息 (peak 和 motif最好放到一个word/pdf文件中提交)。

    • Snf1蛋白是染色体重塑复合体中的重要组成部分之一,为了探究Snf1蛋白会特异性识别哪些DNA序列,Workman小组在酵母中做了该蛋白的ChIP-seq实验。
    • 其中Snf1蛋白IP实验数据为 ip.chrom_part.bam,Input背景数据为input.chrom_part.bam。请同学们根据本节课所学内容寻找Snf1蛋白在DNA上结合的 peak 和 motif。
    • 作业的数据:bam 文件在 Docker 的 /home/test/chip-seq/homework/ 文件夹中。
    • 筛选标准 1. peak:Fold Change (vs Control) >=8,且 p-value (vs Control) < $$10^{-8}$$ ; 2. motif: p-value < $$10^{-10}$$ ;

6) References