-
Notifications
You must be signed in to change notification settings - Fork 6
Oxford Nanopore Sequencing Data Processing
We use the following procedure to process data from Oxford Nanopore sequencing.
Reads from ONT .fast5 are converted to .FASTQ format using poretools. We choose to generate the "pass" and "fail" reads into separate individual .fastq file.
export RUN=2016-08-24-R9-WTD-R009
poretools fastq -q pass > "${RUN}.pass.fastq"
poretools fastq -q fail > "${RUN}.fail.fastq"
To facilitate visualization of read information in IGV browser, we re-assigned Read UUID with a human friendly read id and group the reads accordingly into four different files, namely .2D.fastq, .2Dsupport.fastq, .1D.fastq, and .strange.fastq.
The four separate files allow one to focus the analysis on 2D reads, the template and complement reads generated the 2D consensus reads (2Dsupport), all the 1D reads and finally any remaining reads that do not fall into the earlier groups. The last category (i.e. strange) file is generally empty these days, although there used to be a handful of reads in the earlier days.
export EXPERIMENT=WTD09
export RUN=2016-08-24-R9-WTD-R009
./picky.pl hashFq --pfile ${RUN}.pass.fastq --ffile ${RUN}.fail.fastq --oprefix ${EXPERIMENT}
# alternatively, to analyze pass read downstream only
./picky.pl hashFq --pfile ${RUN}.pass.fastq --oprefix ${EXPERIMENT}.Pass
# or to analyze only fail read downstream
./picky.pl hashFq --ffile ${RUN}.fail.fastq --oprefix ${EXPERIMENT}.Pass
LAST is used to generate possible read alignemnts against a reference genome. It is thus necessary to build the reference genome database to enable the alignment.
last-755/src/lastdb -v -P 1 hg19.lastdb hg19.fa 2>&1 \
| tee lastdb.v755.hg19.log
NOTE: In our experience, it takes 50 min to build hg19 on a single core with 16GB RAM.
NOTE: Lastal database is build once. You only need to re-build it if the sequence has changed.
LAST is used to generate the possible read alignments against a reference genome. Picky selectRep then sieves through the read alignments (.maf stream) to pick the best representative alignment.
export RUNTYPE=WTD09.2D.subset
last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 \
hg19.lastdb\${RUNTYPE}.fastq 2>${RUNTYPE}.lastal.log \
| ./picky.pl selectRep --thread 4 --preload 10 \
1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log
The direct streaming of last output to picky remove the need to store the gigantic intermediate .maf file. However, if you are troubleshooting or experimenting with other lastal alignment parameters and have sufficient storage, you may keep the .maf file and use the following commands instead.
last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 hg19.lastdb ${RUNTYPE}.fastq \
1>${RUNTYPE}.maf 2>${RUNTYPE}.lastal.log
cat ${RUNTYPE}.maf \
| ./picky.pl selectRep --thread 4 --preload 6 \
1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log
NOTE: For .align format, see output documentation
Finally, the selected alignments for read are processed to identify harbored structural variants.
cat ${RUNTYPE}.align \
| ./picky.pl callSV \
--oprefix ${RUNTYPE}.sv \
--fastq ${RUNTYPE}.fastq \
--exclude=chrY --exclude=chrM \
--sam 2>${RUNTYPE}.callSV.log
Each structural variants are classified into the respectively files; namely deletion (.DEL.xls), insertion (.INS.ls), indel (.INDEL.xls), inversion (.INV.xls), translocation (.TTLC.xls), tandem duplication junction (.TDSR.xls), and tandem duplication complete read (.TDC.xls). Read not harboring any structural variant is recorded in .NONE.xls. Each aligned read fragment is recorded in .xls file.
NOTE: See output documentation for details.
File | Description |
---|---|
<oprefix>.profile.DEL.xls | tab-delimited file for deletions |
<oprefix>.profile.INS.xls | tab-delimited file for insertions |
<oprefix>.profile.INDEL.xls | tab-delimited file for possible co-insertion-and-deletion |
<oprefix>.profile.INV.xls | tab-delimited file for inversions |
<oprefix>.profile.TTLC.xls | tab-delimited file for translocations |
<oprefix>.profile.TDSR.xls | tab-delimited file for tandem duplications where read span the junction |
<oprefix>.profile.TDC.xls | tab-delimited file for tandem duplications where read completely cover the duplications |
<oprefix>.profile.xls | tab-delimited file for all read alignment segments |
Earlier ONT base-caller under-call bases in homopolymer region. This manifests as deletions and can be excluded by specifying the option "--removehomopolymerdeletion". It is necessary to provide the genome sequences (.FASTA file; option "--genome") used to construct the lastdb.
ONT base-caller has addressed this issue and you should NOT need this option (off by default) if your base-calling is done with the improved version of the base-caller.
If you are only interested in getting the list of structural variants and NOT the sam file, you may omit the "--sam" option.
Sometime, we may not be interested in SV identified on specific chromosomes. You may use the "--exclude" option. For instance, to exclude structural variant identified on mitochrondria, use the option "--exclude=chrM".
It is possible to convert the tab-delimited SVs .xls files to VCF format.
./picky.pl xls2vcf \
--xls ${RUNTYPE}.sv.profile.DEL.xls \
> ${RUNTYPE}.DEL.vcf
By default, only SVs with at least 2 read evidence support are reported. For low coverage data, one may wish to report even SVs that have single read support only.
./picky.pl xls2vcf \
--xls ${RUNTYPE}.sv.profile.DEL.xls \
--re 1 > ${RUNTYPE}.include.Singleton.DEL.vcf
It is also possible to have all the different types of SVs in a single vcf file.
export OPREFIX=${RUNTYPE}.sv
./picky.pl xls2vcf \
--xls ${OPREFIX}.profile.DEL.xls \
--xls ${OPREFIX}.profile.INS.xls \
--xls ${OPREFIX}.profile.INDEL.xls \
--xls ${OPREFIX}.profile.INV.xls \
--xls ${OPREFIX}.profile.TTLC.xls \
--xls ${OPREFIX}.profile.TDSR.xls \
--xls ${OPREFIX}.profile.TDC.xls \
> ${RUNTYPE}.allsv.vcf