Complementary Frieseomelitta varia genome size estimates

In order to execute genome size estimation based ok k-mer counting we need to execute Jellyfish (run_jellyfish.sh), but first we concatenated all the R1 files into one file (FVARIA_R1.fq) and all the R2 files into another file (FVARIA_R2.fq):

cat ./raw/*_R1_*.fastq > FVARIA_R1.fq
cat ./raw/*_R2_*.fastq > FVARIA_R2.fq

./run_jellyfish.sh

Jellyfish was executed using a range of k-mer sizes, from 19 to 63 with an increment of 4. The Jellyfish histogram of k-mer counts was also created with run_jellyfish.sh.

1st method

Genome size was estimated from all fastq reads by k-mer counting using Jellyfish v. 2.2.0. The genome size estimation was based on GenomeScope analysis.

for i in `ls *.histo`; do k=`basename ${i} .histo | sed 's/mer_out//'`; genomescope.R ${i} ${k} 101 ./GenomeScope${k}; done

cat ./GenomeScope*/summary.txt | grep 'Genome Haploid Length' | perl -F"\s{2,}" -lane 'INIT { $min=0; $max=0; $n=0; } $n++; $F[1]=~s/\D+//g; $F[2]=~s/\D+//g; $min+=$F[1]; $max+=$F[2]; END { print $F[0],"\t",sprintf("%.2f",($min/$n)/1000000)," Mbp","\t",sprintf("%.2f", ($max/$n)/1000000)," Mbp"; }'

Output:

Genome Haploid Length   345.77 Mbp      345.90 Mbp

2st method

Genome size was also estimated from all fastq reads by k-mer counting using Jellyfish v. 2.2.0. The genome size estimation was based on Genome Size Estimation Tutorial of Computational Biology Core of Institute for Systems Biology at University of Connecticut. We create an R script (genomeSize.R) to automate the steps of this tutorial, including the identification and trimming the k-mers with frequencies less than the frequency of the first valley identified in the histogram of k-mer counts obtained with Jellyfish. The peak and the first valley was identified using findPeaks function. The genomeSize.R was executed for a range o k-mer sizes and the average was calculated:

perl -lane 'INIT {my $sum=0; my $n=0;} $n++; $sum+=$_; END { print "Genome size: ".sprintf("%.2f",($sum/$n))." Mbp"; }' <(for i in `ls *.histo`; do  ./genomeSize.R --in=./${i} | grep 'Genome size' | sed 's/Genome size: //' | sed 's/ Mb//'; done)

Output:

Genome size: 389.51 Mbp

3nd method

Genome size was also estimated from all fastq reads by k-mer counting using KMC v. 3.0.0. To do this we set up kmc with the following parameters: to consider kmers of size 25 (-k25); to discard kmers with frequencies less than 5 (-ci5); to use 200 bins (-n200); and to process all using 20 threads (-t20). The file named "files.lst" contains the path to all fastq files. The number of unique counted k-mers was divided by 1,000,000 to obtain the estimated genome size in Mbp.

kmc -k25 -ci5 -n200 -t20 @files.lst result /tmp

Output:

Estimated Genome size (kmc): 381.41 Mbp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genome_estimates.md

genome_estimates.md

Complementary Frieseomelitta varia genome size estimates

Files

genome_estimates.md

Latest commit

History

genome_estimates.md

File metadata and controls

Complementary Frieseomelitta varia genome size estimates