█████████ █████████ ███████████ █████████
███░░░░░███ ███░░░░░███░░███░░░░░███ ███░░░░░███
███ ░░░ ███ ░░░ ░███ ░███░███ ░░░
░███ ░███ ░██████████ ░░█████████
░███ █████░███ █████ ░███░░░░░███ ░░░░░░░░███
░░███ ░░███ ░░███ ░░███ ░███ ░███ ███ ░███
░░█████████ ░░█████████ ███████████ ░░█████████
░░░░░░░░░ ░░░░░░░░░ ░░░░░░░░░░░ ░░░░░░░░░
GGBS is the first implemented Benchmark Suite for sequence-to-graph alignment in the genomic analysis context. It includes multiple state-of-the-art tools and six different genome graphs built with VG Toolkit1. Tools included are:
Select which tools and version you want to use in tools_config.yml
.
Docker >= 24.0
First, get the repo:
git clone https://github.com/Mirkocoggi/GGBS.git
Then just run:
cd GGBS
The first step is to create a Dockerfiles folder containing a subfolder for each tool. In each subfolder, there will be the respective Dockerfile. Run:
python make_dockerfiles.py
The folder input_data
contains genome graphs and sequence reads to be aligned.
Each alignment experiment has its folder, comprising two subfolders:
GRAPH
: contains the input graph in GFA format;READS
: contains the sequence reads in FASTA/Q format.
The experiments' folders are grouped into two higher-level folders:
TEST
: contains the alignments to be executed with the previously selected tools;IGNORE
: contains alignment experiments that should not be included in the evaluation.
Create the docker-compose.yml
that builds all the docker images and executes all the experiments.
Run:
python make_dockercompose.py
Run the experiments with the command:
docker compose up
Results are uploaded in the results
folder, where a subdirectory named with a timestamp is created for each experiment.
To collect all the execution times, you can generate a summary_timing.csv
file by running:
python utils/timing.py results/<timestamp>
The summary file is generated in results/<timestamp>/summary_timing.csv
.
Footnotes
-
E. Garrison, J. Sir ́en, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F. Lin et al., “Variation graph toolkit improves read mapping by representing genetic variation in the reference,” Nature biotechnology, vol. 36, no. 9, pp. 875–879, 2018 ↩
-
P. Ivanov, B. Bichsel, and M. Vechev, “Fast and optimal sequence-to-graph alignment guided by seeds,” in International Conference on Research in Computational Molecular Biology. Springer, 2022, pp. 306–325 ↩
-
M. Rautiainen, V. M ̈akinen, and T. Marschall, “Bit-parallel sequence-to-graph alignment,” Bioinformatics, vol. 35, no. 19, 2019. ↩
-
H. Zhang, S. Wu, S. Aluru, and H. Li, “Fast sequence to graph alignment using the graph wavefront algorithm,” arXiv preprint arXiv:2206.13574, 2022. ↩
-
C. Jain, H. Zhang, Y. Gao, and S. Aluru, “On the complexity of sequence-to-graph alignment,” Journal of Computational Biology, vol. 27, no. 4, pp. 640–654, 2020. ↩
-
V. N. S. Kavya, K. Tayal, R. Srinivasan, and N. Sivadasan, “Sequence alignment on directed graphs,” Journal of Computational Biology, vol. 26, no. 1, pp. 53–67, 2019. ↩
-
J. Sir ́en, J. Monlong, X. Chang, A. M. Novak, J. M. Eizenga, C. Markello, J. A. Sibbesen, G. Hickey, P.-C. Chang, A. Carroll et al., “Genotyping common, large structural variations in 5,202 genomes using pangenomes, the giraffe mapper, and the vg toolkit,” BioRxiv, pp. 2020–12, 2020 ↩
-
T. B¨uchler, J. Olbrich, and E. Ohlebusch, “Efficient short read mapping to a pangenome that is represented by a graph of ed strings,” Bioinformatics, vol. 39, no. 5, p. btad320, 2023. ↩