Skip to content

Two-species genome comparison pipeline (combines minimap2/SyRI, EDTA, bedtools) to identify structural variants (inv/syn), annotate TE, and plot results in R.

Notifications You must be signed in to change notification settings

kaede0e/two-species-genome-comparison-pipeline

Repository files navigation

Comparative genomics: Two closely related species genome alignment, structural variant detection & CDS/TE overlap analysis pipeline

Two-species genome comparison pipeline (combines minimap2/SyRI, EDTA, bedtools) to identify structural variants (inv/syn), annotate and analyze overlaps with coding sequence and TE, and plot results in R.
Used in Hirabayashi & Owens (2023)

Genome alignment and variant detection

  • Genome alignment is useful because it tells you how the two species are related at genomic level.
  • Relevant for phylogenetics, evolutionary analysis, population genetics etc.
  • Genomic variants include large-scale structural variations (inversions, translocations, etc.) that are not captured by genome-wide genetic polymorphism scans like SNPs and InDels identifications. Whole genome alignment has advantages to capture synonymous changes which are more common than nonsynonymous, interesting genomic variants that affect the genome evolution in a non-traditional central dogma (Transposable elements or selfish elements that don’t code for particular genes but affects the organism’s phenotype, etc.)
  • Each step in this pipeline can be performed individually as long as you have the proper input data. But as a whole pipeline, it helps you to produce a readily useful statistics that indicates interesting information about your species of interest at genomic level.
  • Widely applicable to any organisms you’re interested in. No upper or lower limit in terms of genome complexity (theoretically), though the more complicated the genome the more computational resource it requires to run.

Limitation of this pipeline

  • relies on the quality of genome you have. Only chromosome-resolved genomes can be used as the input.
  • The two species of interest must contain sufficient genomic region that is syntenic
  • certain steps require appropriate amount of computational resource as shown:
Input genome (size) Step time required (hrs) CPU memory (GB)
1 Prunus (202Mbp) minimap2+SyRI 5 10
EDTA 36* 5.5
2 Arachis (1.01Gbp) minimap2+SyRI 8.7 72
EDTA 142* 18.9

*Wall-clock time using 16 CPU, run by chromosome-by-chromosome.

General pipeline structure:

Picture1

Dependencies/Installation

The pipeline requires following software packages to run:

Other dependencies (in Compute Canada clusters):

Run the pipeline

  1. Collect genomes (fasta) of two species (referred to as reference genome = refgenome, and query genome = qrygenome) and gene annotation file (gff3 or bed) for reference genome
  2. 01_preprocess_input: Pre-process genomes to remove scaffolds from genome assembly and convert gene annotation file into bed format containing only coding sequence (CDS)
  3. 02_genome_alignment_SV_detection: Align refgenome with qrygenome by minimap2 and performs structural variant detection using SyRI. Visualize in R Studio and infer divergence score based on sequence identity
  4. 03_divergence_analysis: Calculate species divergence score and vsualize syntenic region and inversion stats.
  5. 04_de_novo_TE_annotation: Identify and annotate the input genome with transposable elements (TEs) de novo.
  6. 05_CDS_TE_overlap_analysis: Count how many coding sequences (CDS) or TEs are found and how many basepairs of these are overlapping with inversions or syntenic regions.

About

Two-species genome comparison pipeline (combines minimap2/SyRI, EDTA, bedtools) to identify structural variants (inv/syn), annotate TE, and plot results in R.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published