本章我们将:
- 理解和掌握基本的基因差异表达分析方法(Differential Expression Analysis);学会 Differential Expression Analysis 的基本软件。
- 使用 TopHat 和 Cufflinks 完成 Differential Expression Analysis。
-
software (already available in Docker,docker images的下载链接如附表所示)
- R & cummeRbund package (BioConductor)
tophat
bamtools
cufflinks
-
Data (already available in Docker)
The Yeast RNA-seq data were downloaded from GSE42983,有两种条件,每种两个生物重复
- wild-type :
wt1.fq
&wt2.fq
- H2O2 treatment (oxidative stress):
wt1X.fq
&wt2X.fq
We random sample 1M or 10K reads per sample, 10K reads are in
/home/test/diff-exp/Raw_reads_10k
of the Docker. - wild-type :
format | description | Notes |
---|---|---|
.fastq | RNA sequences of each sample | raw data of RNA-seq |
.fa | sequences of the whole genome | |
.gff | genome annotation file | Default genome annotation file is from GENCODE, is similar to gtf file |
.ebwt | bowtie index file | used to align RNA sequences to genome |
format | description | Notes |
---|---|---|
.bam | genome mapped reads (binary format of sam) | generated by align step |
transcript.gtf | assembled transcripts of each sample | generated by assemble step |
merged.gtf | all annotation of assembled transcripts | generated by merge step |
txt/tsv | differentially expressed genes | generated by DE gene identify step |
R plot to explore differentially expressed genes | generated by R package |
cuffdiff output
format | description |
---|---|
gene_exp.diff | Differentially expressed genes |
isoform_exp.diff | Differentially expressed transcripts |
cds_exp.diff | Differentially expressed coding sequences |
cds.diff | Genes with differential CDS output |
promoters.diff | Genes with differential promoter use |
splicing.diff | Differentially spliced TSS groups |
tss_group_exp.diff | Differentially expressed TSS groups |
和之前的章节一样,首先进入到docker容器:
docker exec -it bioinfo_tsinghua bash
以下步骤均在 /home/test/diff-exp/
下进行:
cd /home/test/diff-exp/
(1) map reads using Tophat
tophat
maps short sequences from spliced transcripts to whole genome.
mkdir mapping
tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt1_thout \
bowtie_index/YeastGenome Raw_reads_10k/wt1.fq
tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt2_thout \
bowtie_index/YeastGenome Raw_reads_10k/wt2.fq
tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt1X_thout \
bowtie_index/YeastGenome Raw_reads_10k/wt1X.fq
tophat -p 4 -G yeast_annotation.gff --no-coverage-search -o mapping/wt2X_thout \
bowtie_index/YeastGenome Raw_reads_10k/wt2X.fq
用法:tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]]
-p/--num-threads
default: 1-G/--GTF
GTF/GFF with known transcripts-o/--output-dir
default:./tophat_out
(2) Extract mapped reads on chr I only (for quick running)
-
index
.bam
fileindex
command generates index for.bam
file
bamtools index -in mapping/wt1_thout/accepted_hits.bam
bamtools index -in mapping/wt2_thout/accepted_hits.bam
bamtools index -in mapping/wt1X_thout/accepted_hits.bam
bamtools index -in mapping/wt2X_thout/accepted_hits.bam
-
extract
filter
command filters.bam
files by user-specified criteria
bamtools filter -region chrI -in mapping/wt1_thout/accepted_hits.bam \
-out mapping/wt1_thout/chrI.bam
bamtools filter -region chrI -in mapping/wt2_thout/accepted_hits.bam \
-out mapping/wt2_thout/chrI.bam
bamtools filter -region chrI -in mapping/wt1X_thout/accepted_hits.bam \
-out mapping/wt1X_thout/chrI.bam
bamtools filter -region chrI -in mapping/wt2X_thout/accepted_hits.bam \
-out mapping/wt2X_thout/chrI.bam
(1) Assemble transcripts for each sample:
usage: cufflinks -p number_of_cores -o output_dir input_file
cufflinks -p 4 -o assembly/wt1_clout mapping/wt1_thout/chrI.bam
cufflinks -p 4 -o assembly/wt2_clout mapping/wt2_thout/chrI.bam
cufflinks -p 4 -o assembly/wt1X_clout mapping/wt1X_thout/chrI.bam
cufflinks -p 4 -o assembly/wt2X_clout mapping/wt2X_thout/chrI.bam
(2) Create a file called assemblies.txt
which lists the assembly files for each sample in the assembly
folder. Its content is as follows:
assembly/wt1X_clout/transcripts.gtf
assembly/wt1_clout/transcripts.gtf
assembly/wt2X_clout/transcripts.gtf
assembly/wt2_clout/transcripts.gtf
You can create that file manually by vim
or use this command:
ls assembly/*/transcripts.gtf > assembly/assemblies.txt
(3) Merge all assemblies to one file containing merged transcripts:
cuffmerge
takes two or more Cufflinks .gtf
files and merges them into a single unified transcript catalog
cuffmerge -g yeast_chrI_annotation.gff -s bowtie_index/YeastGenome.fa \
-p 4 -o assembly/merged assembly/assemblies.txt
we’ll use the output from 1M reads, not 10K reads
Run cuffdiff
based on the merged transcripts and mapped reads for each sample
cuffdiff -o diff_expr -b bowtie_index/YeastGenome.fa \
-p 4 -u assembly/merged/merged.gtf \
mapping/wt1_thout/chrI.bam,mapping/wt2_thout/chrI.bam \
mapping/wt1X_thout/chrI.bam,mapping/wt2X_thout/chrI.bam
参数意义:
-o
: 输出文件夹-b
: Bowtie index-p
: 使用的线程数-u
: 转录本注释文件,一般使用 merge 得到的文件- 最后两行为每个样本的
.bam
文件,这里上一行为对照组,下一行为实验组,组内的不同样本用英文逗号分隔
This won’t generate sufficient result from 10K reads。You may view the results of 1M reads in 1M_reads_diff_expr
to get a better feeling of the above command.
由于 10k reads 结果太少,我们用 1M reads 的结果(已经预先跑好)来画图
Rscript plot_DE_chart.R
图片可在 1M_reads_diff_expr
中查看。
Tips: input/output file is hard-coded in
plot_DE_chart.R
, if you want to plot your own results, you need to editplot_DE_chart.R
(replace1M_reads_diff_expr/
with your own output directory ofcutdiff
).
-
提交差异表达的gene的 name,注明相关的fold change, p-value,FDR(q value) 等信息, 并提交绘制的 Volcano Plot。(基因列表和图最好放到一个word/pdf文件中提交)。
- 使用教程中的数据,操作时,跳过3a)(2) Extract,(直接用accepted_hits.bam 来进行 Assemble transcripts),要求找到差异表达的基因(满足$$\lvert log_2(fold\ change) \rvert$$ > 1, FDR < 0.05)。
- 差异表达的基因有的没有基因 name, 可以用 gene id 代替。
- 画 Volcano Plot 时,需要将所有的基因进行绘图,包括未差异表达的基因。画图用 Appenddix. Plot with R 中 7 Volcano plots 的代码画图。
-
寻找 differentially expressed genes/transcripts时要对数据做怎样的处理?这些处理有哪些统计学上的考虑?
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks Trapnell C et al. Nat Protoc. 2012
- Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown Pertea M et al. Nat Protoc. 2016