Skip to content

Latest commit

 

History

History
156 lines (115 loc) · 4.99 KB

1.1-data-introduction.md

File metadata and controls

156 lines (115 loc) · 4.99 KB

1.1.Data Introduction

{% hint style="info" %}

数据介绍

{% endhint %}

我们使用的数据主要包括两种癌症和正常人样本,其中Colorectal Cancer, Prostate Cancer和Healthy Control的样本数量分别为99,36和50。数据存放公共目录为cnode服务器的/BioII/chenxupeng/student/目录。

  • data目录下为已经建好的expression matrix,相应的label和annotation
  • data目录下另外的文件夹中存放的文件是读者用于自己完成对五个正常人样本Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25进行mapping和创建expression matrix等操作的。

1) mapping相关文件

路径:包括/BioII/chenxupeng/student/data/目录下的hg38_index, raw_data, RNA_index文件夹。

data path
raw data /BioII/chenxupeng/student/data/raw_data/*.fastq
hg38 /BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa
gtf /BioII/chenxupeng/student/data/gtf
RNA index /BioII/chenxupeng/student/data/RNA_index/

具体内容参考 11.1 Helps: mapping指南

2) expression matrix

路径:/BioII/chenxupeng/student/data/expression_matrix/

expression matrix每一行为一个feature,每一列为一个样本,其中我们去掉了Sample_N13, Sample_N19, Sample_N25三个样本的相应数据,需要读者自己完成mapping和构建expression matrix(详见 11.2 Requirement: Expression Matrix)。

import pandas as pd
import numpy as np
scirepount = pd.read_table('data/expression_matrix/GSE71008.txt',sep=',',index_col=0)
scirepount.iloc[:,:5].head()
Sample_1S10 Sample_1S11 Sample_1S12 Sample_1S13 Sample_1S14
transcript
ENST00000473358.1|MIR1302-2HG-202|1544 0 0 0 0 0
ENST00000469289.1|MIR1302-2HG-201|843 0 0 0 0 0
ENST00000466430.5|AL627309.1-201|31638 0 0 0 0 0
ENST00000471248.1|AL627309.1-203|18221 0 0 0 0 0
ENST00000610542.1|AL627309.1-205|12999 0 0 0 0 0
scirepount.shape
(89619, 188)

3) sample labels

路径:/BioII/chenxupeng/student/data/labels

scirep_samplenames = pd.read_table('data/labels/scirep_classes.txt',delimiter=',' , index_col=0)
scirep_samplenames.head()
label
sample_id
Sample_1S3 Colorectal Cancer
Sample_1S6 Colorectal Cancer
Sample_1S9 Colorectal Cancer
Sample_1S12 Colorectal Cancer
Sample_1S15 Colorectal Cancer
delete_sample = ['Sample_N1','Sample_N7','Sample_N13','Sample_N19','Sample_N25']
check_sample = ['Sample_N1','Sample_N7']
np.unique(scirep_samplenames['label'],return_counts=True)
(array(['Colorectal Cancer', 'Healthy Control', 'Pancreatic Cancer',
        'Prostate Cancer'], dtype=object), array([99, 50,  6, 36]))

4) other annotations

路径:/BioII/chenxupeng/student/data/other_annotations

4a) gene annotation

可以通过feature的transcript id找到feature的transcript_name, gene_type等信息

geneannotation = pd.read_table('data/other_annotations/transcript_anno.txt')
geneannotation.iloc[:,:5].head()
chrom start end name score
0 chr1 14629 14657 piR-hsa-18438 0
1 chr1 17368 17436 ENSG00000278267.1 0
2 chr1 18535 18563 piR-hsa-7508 0
3 chr1 26805 26836 piR-hsa-23387 0
4 chr1 29553 31097 ENSG00000243485.5 0

4b) batch信息

batch信息记录了对不同样本采取的不同实验条件,包括处理时间,处理材料的规格差异等,可能会造成同类样本的较大差异,称为batch effect。

对于exoRBase数据,每一种癌症样本均来自不同的实验室,因此其batch与样本类别重合。对于scirep数据和hcc数据,batch信息如下:

scirepbatch = pd.read_csv('data/other_annotations/scirep_batch.txt',index_col=0)
scirepbatch.head()
RNA Isolation batch library prepration day gel cut size selection
Sample_1S1 2 22 7
Sample_1S2 2 22 8
Sample_1S3 2 22 1
Sample_2S1 2 22 2
Sample_2S2 2 22 3

5) RNA type 统计信息

scireprnastats = pd.read_csv('data/other_annotations/scirep_rna_stats.txt',index_col=0)
scireprnastats.iloc[:,:5].head()
Sample_1S10 Sample_1S11 Sample_1S12 Sample_1S13 Sample_1S14
Y_RNA 88835 127497 145142 90106 105377
cleanN 9034303 10963430 11077344 10262615 11065325
hg38other 1462269 2044478 2624270 1476586 1806268
libSizeN 11362190 13437632 13905951 12271219 13619701
lncRNA 26733 38346 35639 25523 31489