Genome assembly and SV analysis pipelines for Drosophila pseudoobscura (ST) and Drosophila persimilis (M40) - Extended methods.
This section shows the genome assembly pipeline used (from paper) for the D. pseudoobscura (ST) and D. persimilis (M40). This pipeline implements a hybrid assembly using Illumina short reads and PacBio CLR reads.
Generation of a PacBio-only, gap-filled and polished assembly: this assembly was assembled directly by Pacific Biosciences using HGAP-Arrow with default parameters (please refer to the HGAP github page for details). The assembly statistics for each species are available at the Supplementary_Tables.xlsx file (Table S13).
- Gap-filling of the PacBio-only assemblies using PbJelly.
Commands: mapping jellyProtocol.xml support jellyProtocol.xml extraction jellyProtocol.xml assembly jellyProtocol.xml -x --nproc=8
NOTE: Please refere to the jellyProtocol.xml for full parameter details.
- Polishing using Pilon
bwa mem -t 5 st_pbjelly.fasta file_1.fastq file_2.fastq > pbjelly.sam 2> stderror.txt
samtools view -bS pbjelly.sam > pbjelly.bam 2> stderror_samtoolsview.txt
samtools sort -@ 5 -o pbjelly_sorted.bam pbjelly.bam 2> stderror_samtoolssort.txt
samtools view -b -F 12 pbjelly_sorted.bam > pbjelly_mapped_sorted.bam 2> stderror_samtoolsmap.txt
samtools index pbjelly_mapped_sorted.bam 2> stderror_samtoolsindex.txt
java -Xmx30G -jar pilon-1.22.jar --genome pbjelly.fasta --frags pbjelly_mapped_sorted.bam --threads 5 --changes 2> stderror_pilon.txt
Generation of a hybrid assembly using CLR reads and Illumina paired-end reads using DBG2OLC.
SparseAssembler LD 0 k 51 g 15 NodeCovTh 1 EdgeCovTh 0 GS 171281433 i1 file_1.fastq o1 file_2.fastq > sparseAssembler_test1.log
SparseAssembler LD 1 k 51 g 15 NodeCovTh 2 EdgeCovTh 1 GS 171281433 i1 st_1.fastq o1 st_2.fastq > sparseAssembler_test1.log
DBG2OLC k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.002 LD1 0 MinLen 200 Contigs Contigs.txt RemoveChimera 1 f pacbio.fasta
Final scaffolding step using quickmerge
nucmer -l 100 -p out -t 10 pacbiopolished.fasta hybrid.fasta 2> stderror_nucmer.txt
delta-filter -i 95 -r -q > 2> stderror_deltafilter.txt
quickmerge -d -q hybrid.fasta -r pacbiopolished.fasta -hco 5.0 -c 1.5 -l n -ml m 2> stderror_quickmerge.txt
This section shows the followed approach for the structural variation analysis that includes INDEL and CNV calling using CLR reads and whole genome alignments.
SV calling using svim. Pipeline is contained in the file.
SV calling using svmu. svmu commands used for all species pair-wise comparisons.
nucmer -t 10 --maxmatch reference_genome.fasta query_genome.fasta
lastz reference_genome.fasta[multiple] query_genome.fastaa[multiple] --chain --format=general:name1,strand1,start1,end1,name2,strand2,start2,end2 > file_lastz.txt
svmu reference_genome.fasta query_genome.fasta 5 l file_lastz.txt file-svmu
SV polarization of D. pseudoobscura and D. persimilis variants with D. miranda. Overall approach and scripts are available at the SV folder.
SV calling using syri for each species pair.
minimap2 -ax asm5 -t20 --eqx refgenome qrygenome > out.sam
python3 syri -c out.sam -r refgenome -q qrygenome -k -F S
Please refer to the for the complete pipeline.