{% hint style="info" %}
{% endhint %}
我们使用的数据主要包括两种癌症和正常人样本,其中Colorectal Cancer, Prostate Cancer和Healthy Control的样本数量分别为99,36和50。数据存放公共目录为cnode服务器的/BioII/chenxupeng/student/
目录。
data
目录下为已经建好的expression matrix,相应的label和annotationdata
目录下另外的文件夹中存放的文件是读者用于自己完成对五个正常人样本Sample_N1, Sample_N7, Sample_N13, Sample_N19, Sample_N25
进行mapping和创建expression matrix等操作的。
路径:包括/BioII/chenxupeng/student/data/
目录下的hg38_index
, raw_data
, RNA_index
文件夹。
data | path |
---|---|
raw data |
/BioII/chenxupeng/student/data/raw_data/*.fastq |
hg38 |
/BioII/chenxupeng/student/data/hg38_index/GRCh38.p10.genome.fa |
gtf |
/BioII/chenxupeng/student/data/gtf |
RNA index |
/BioII/chenxupeng/student/data/RNA_index/ |
具体内容参考 11.1 Helps: mapping指南
路径:/BioII/chenxupeng/student/data/expression_matrix/
expression matrix每一行为一个feature,每一列为一个样本,其中我们去掉了Sample_N13, Sample_N19, Sample_N25
三个样本的相应数据,需要读者自己完成mapping和构建expression matrix(详见 11.2 Requirement: Expression Matrix)。
import pandas as pd
import numpy as np
scirepount = pd.read_table('data/expression_matrix/GSE71008.txt',sep=',',index_col=0)
scirepount.iloc[:,:5].head()
Sample_1S10 | Sample_1S11 | Sample_1S12 | Sample_1S13 | Sample_1S14 | |
---|---|---|---|---|---|
transcript | |||||
ENST00000473358.1|MIR1302-2HG-202|1544 | 0 | 0 | 0 | 0 | 0 |
ENST00000469289.1|MIR1302-2HG-201|843 | 0 | 0 | 0 | 0 | 0 |
ENST00000466430.5|AL627309.1-201|31638 | 0 | 0 | 0 | 0 | 0 |
ENST00000471248.1|AL627309.1-203|18221 | 0 | 0 | 0 | 0 | 0 |
ENST00000610542.1|AL627309.1-205|12999 | 0 | 0 | 0 | 0 | 0 |
scirepount.shape
(89619, 188)
路径:/BioII/chenxupeng/student/data/labels
scirep_samplenames = pd.read_table('data/labels/scirep_classes.txt',delimiter=',' , index_col=0)
scirep_samplenames.head()
label | |
---|---|
sample_id | |
Sample_1S3 | Colorectal Cancer |
Sample_1S6 | Colorectal Cancer |
Sample_1S9 | Colorectal Cancer |
Sample_1S12 | Colorectal Cancer |
Sample_1S15 | Colorectal Cancer |
delete_sample = ['Sample_N1','Sample_N7','Sample_N13','Sample_N19','Sample_N25']
check_sample = ['Sample_N1','Sample_N7']
np.unique(scirep_samplenames['label'],return_counts=True)
(array(['Colorectal Cancer', 'Healthy Control', 'Pancreatic Cancer',
'Prostate Cancer'], dtype=object), array([99, 50, 6, 36]))
路径:/BioII/chenxupeng/student/data/other_annotations
可以通过feature的transcript id找到feature的transcript_name, gene_type等信息
geneannotation = pd.read_table('data/other_annotations/transcript_anno.txt')
geneannotation.iloc[:,:5].head()
chrom | start | end | name | score | |
---|---|---|---|---|---|
0 | chr1 | 14629 | 14657 | piR-hsa-18438 | 0 |
1 | chr1 | 17368 | 17436 | ENSG00000278267.1 | 0 |
2 | chr1 | 18535 | 18563 | piR-hsa-7508 | 0 |
3 | chr1 | 26805 | 26836 | piR-hsa-23387 | 0 |
4 | chr1 | 29553 | 31097 | ENSG00000243485.5 | 0 |
batch信息记录了对不同样本采取的不同实验条件,包括处理时间,处理材料的规格差异等,可能会造成同类样本的较大差异,称为batch effect。
对于exoRBase数据,每一种癌症样本均来自不同的实验室,因此其batch与样本类别重合。对于scirep数据和hcc数据,batch信息如下:
scirepbatch = pd.read_csv('data/other_annotations/scirep_batch.txt',index_col=0)
scirepbatch.head()
RNA Isolation batch | library prepration day | gel cut size selection | |
---|---|---|---|
Sample_1S1 | 2 | 22 | 7 |
Sample_1S2 | 2 | 22 | 8 |
Sample_1S3 | 2 | 22 | 1 |
Sample_2S1 | 2 | 22 | 2 |
Sample_2S2 | 2 | 22 | 3 |
scireprnastats = pd.read_csv('data/other_annotations/scirep_rna_stats.txt',index_col=0)
scireprnastats.iloc[:,:5].head()
Sample_1S10 | Sample_1S11 | Sample_1S12 | Sample_1S13 | Sample_1S14 | |
---|---|---|---|---|---|
Y_RNA | 88835 | 127497 | 145142 | 90106 | 105377 |
cleanN | 9034303 | 10963430 | 11077344 | 10262615 | 11065325 |
hg38other | 1462269 | 2044478 | 2624270 | 1476586 | 1806268 |
libSizeN | 11362190 | 13437632 | 13905951 | 12271219 | 13619701 |
lncRNA | 26733 | 38346 | 35639 | 25523 | 31489 |