This protocal was used for sequencing data quality control and preprocessing. We showed how to use common tools in quality control and preprocessing of sequencing reads.
The specific purposes of the dir system were showed as follows:
1. Input: We store the partial raw input for user to test the sample data quickly without additional data downloading. The data in input directory are incomplete, user can download the complete data in Data access.
2. Output: The final output results of each tools in workflow.
3. Workflow: Step by step pipeline.
4. README.md: In readme file, user can learn basic information for data access and tool usages.
5. graphs: Figures in analysis.
6. LICENSE: The copyright file.
Here, we provide the sample data for quality control and preprocessing of sequencing reads, and the softwares used in pipeline were provided with their download links. The complete smaple data can be accessed as follows: SRR2061397_1.fastq.gz, SRR2061397_2.fastq.gz, SRR2061398_1.fastq.gz, SRR2061398_2.fastq.gz.
Those data are about Arabidopsis thaliana response to cytokinin in roots and shoots, which aims to identify genes which are differentially expressed in root and/or shoot material in response to exogenous cytokinin.pbaaims to identify genes which are differentially expressed in root and/or shoot material in response to exogenous cytokinin.
The example data used here is the paired-end fastq file generated by using Illumina platform.
- R1 FASTQ file:
Input/SRR2061397_1.fastq
- R2 FASTQ file:
Input/SRR2061397_2.fastq
Each entry in a FASTQ files consists of 4 lines:
- A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
- The sequence (the base calls; A, C, T, G and N).
- A separator, which is simply a plus (+) sign.
- The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.
The first entry of the input data:
@SRR2061397.1 UNC10-SN254:519:H8PAHADXX:2:1101:1390:2053/1
ATTTTCTCATTCTCTCCTACAACAAGTTTCCGCCGCTAAGCCACTCACTACGCAGCCATGGCAAATCCTGACCATGTTTCAGATGTGCTACATAATTTGT
+
@@@FFDFFGD<FHHIGBGGGHGEGC9<ACF<@FHIFDGIEDG3?BBH=DHHHGE;BHFHCBDD>DECCC>>CC;(;@@CAACA>C@C>>;@ACA@CC:@:
The prequisite softwares can be obtained by visiting their released website. For example,
User can install and use those softwares with linux-like system.
By integrating those softwares, we could finish the quality control and preprocessing for high-throughput data in multiple way. In addition, we provide the simple usage of those software for the users.
- Note that you have to normalize the path in the shell script.
sh Workflow/1_run_fastqc.sh
- Note that you have to normalize the path in the shell script.
sh Workflow/2_run_iTools.sh
sh Workflow/3_run_cutadapt.sh
Step4: run fastp to perform quality control, adapter trimming, quality filtering and per-read quality pruning
sh Workflow/4_run_fastp.sh fastp
sh Workflow/5_run_fastx_clipper.sh
Fastqc Result for SRR2061397_1 showed the distribution of quality score for each base: Fastqc Result for SRR2061397_2 showed the distribution of quality score for each base: Fastp Result for SRR2061397_1 showed the distribution of base content ratios across read: Fastp Result for SRR2061397_2 showed the distribution of base content ratios across read: iTools Result for SRR2061397_1 showed the basic information of reads, sucha read number, read length, GC content, and base content:
##SRR2061397_1.fastq##
#ReadNum: 25000 BaseNum: 2500000 ReadLeng: 100
#GC%: 44.71% AT%: 55.29%
#A BaseNum: 692018 27.68%
#C BaseNum: 574981 23.00%
#T BaseNum: 690226 27.61%
#G BaseNum: 542734 21.71%
#N BaseNum: 41 0.00%
#BaseQ:0--10 : 23575(0.94%) >Q10: 99.06%
#BaseQ:10--20 : 13795(0.55%) >Q20: 98.51%
#BaseQ:20--30 : 75048(3.00%) >Q30: 95.50%
#BaseQ:30--40 : 1471835(58.87%) >Q40: 36.63%
#BaseQ:40--50 : 915747(36.63%) >Q50: 0.00%
#ReadQ:0--10 : 22(0.09%) >Q10: 99.91%
#ReadQ:10--20 : 205(0.82%) >Q20: 99.09%
#ReadQ:20--30 : 795(3.18%) >Q30: 95.91%
#ReadQ:30--40 : 23977(95.91%) >Q40: 0.00%
#ReadQ:40--50 : 1(0.00%) >Q50: 0.00%
##SRR2061397_2.fastq##
#ReadNum: 25000 BaseNum: 2500000 ReadLeng: 100
#GC%: 44.71% AT%: 55.18%
#A BaseNum: 689144 27.57%
#C BaseNum: 569646 22.79%
#T BaseNum: 690456 27.62%
#G BaseNum: 548221 21.93%
#N BaseNum: 2533 0.10%
#BaseQ:0--10 : 75065(3.00%) >Q10: 97.00%
#BaseQ:10--20 : 23194(0.93%) >Q20: 96.07%
#BaseQ:20--30 : 105595(4.22%) >Q30: 91.85%
#BaseQ:30--40 : 1456988(58.28%) >Q40: 33.57%
#BaseQ:40--50 : 839158(33.57%) >Q50: 0.00%
#ReadQ:0--10 : 365(1.46%) >Q10: 98.54%
#ReadQ:10--20 : 465(1.86%) >Q20: 96.68%
#ReadQ:20--30 : 1203(4.81%) >Q30: 91.87%
#ReadQ:30--40 : 22967(91.87%) >Q40: 0.00%
Cutadapt Result showed the summary of output:
===SRR2061397 Summary ===
Total read pairs processed: 25,000
Read 1 with adapter: 31 (0.1%)
Read 2 with adapter: 30 (0.1%)
Pairs that were too short: 1,038 (4.2%)
Pairs written (passing filters): 23,962 (95.8%)
Total basepairs processed: 5,000,000 bp
Read 1: 2,500,000 bp
Read 2: 2,500,000 bp
Quality-trimmed: 222,267 bp (4.4%)
Read 1: 67,386 bp
Read 2: 154,881 bp
Total written (filtered): 4,700,683 bp (94.0%)
Read 1: 2,365,705 bp
Read 2: 2,334,978 bp
It is a free and open source software, licensed under GPLv3