Repository for the paper "Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graph" in PyTorch. (paper)
Install the required packages using the following command:
conda env create -f environment.yml
conda activate metagenomic_representation_learning
The Edit Distance Approximation task is initialized using one of the embeddings and then fine-tuned with a single linear layer.
Without minibatching:
# k = 3
python -m src.train.main_editDistance --data datasets/edit_distance/edit_qiita_large.pkl --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 3 --representation_small_k 2 --representation_ss_hidden_channels 32_DB,32_KF0 --representation_ss_last_layer_edge_type DB --representation_size 32 --model_class mlp
# k = 4
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn --representation_k 4 --representation_small_k 2,3 --representation_ss_task CL --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 2000 --representation_size 64 --model_class mlp
# k = 5
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn --representation_k 4 --representation_small_k 2,3,4 --representation_ss_task CL --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 1000 --representation_size 64 --model_class mlp
# k = 6
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn --representation_k 6 --representation_small_k 2,5 --representation_ss_task CL --representation_ss_hidden_channels 128_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 500 --representation_size 64 --representation_ss_edges_keep_top_k 0.01 --model_class mlp
With minibatching:
# k = 7
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn_miniBatch --representation_k 7 --representation_small_k 2,5 --representation_ss_task CL --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 100 --representation_size 64 --representation_ss_edges_keep_top_k 0.01 --representation_ss_edges_threshold 0.8 --representation_ss_batch_size 1024 --model_class mlp
# k = 8
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn_miniBatch --representation_k 8 --representation_small_k 2,5 --representation_ss_task CL --representation_ss_hidden_channels 32_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 100 --representation_size 32 --representation_ss_edges_keep_top_k 0.01 --representation_ss_edges_threshold 0.8 --representation_ss_batch_size 256 --model_class mlp
For full list of hyperparameters see src/train/parsers.py
.
Use the --representation_data
flag to specify the dataset path, and the --representation_k
flag for the desired k value.
# OneHot
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_onehot --representation_k 3 --model_class mlp
# Word2Vec
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_word2vec --representation_k 3 --model_class mlp
# Node2Vec
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_node2vec --representation_k 3 --model_class mlp
For full list of hyperparameters see src/train/parsers.py
.
For method fine-tuned on Edit Distance Approximation, without minibatching, use:
# k = 3
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 3 --representation_small_k 2 --representation_ss_hidden_channels 32_DB,32_KF0 --representation_ss_last_layer_edge_type DB --representation_size 32 --model_class cnn1d
# k = 4
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 4 --representation_small_k 2,3 --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 2000 --representation_size 64 --model_class cnn1d
# k = 5
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 5 --representation_small_k 2,3,4 --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 1000 --representation_size 64 --model_class cnn1d
# k = 6
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 6 --representation_small_k 2,5 --representation_ss_hidden_channels 128_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 500 --representation_size 64 --representation_ss_edges_keep_top_k 0.01 --model_class cnn1d
For method fine-tuned on Edit Distance Approximation, with minibatching, use:
# k = 7
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_ssgnn_miniBatch --representation_ss_task CL --representation_k 7 --representation_small_k 2,5 --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 100 --representation_size 64 --representation_ss_edges_keep_top_k 0.01 --representation_ss_edges_threshold 0.8 --representation_ss_batch_size 1024 --model_class cnn1d --representation_ss_batch_size 1024
# k = 8
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_ssgnn_miniBatch --representation_ss_task CL --representation_k 8 --representation_small_k 2,5 --representation_ss_hidden_channels 32_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 100 --representation_size 32 --representation_ss_edges_keep_top_k 0.01 --representation_ss_edges_threshold 0.8 --representation_ss_batch_size 256 --model_class cnn1d --representation_ss_batch_size 256
For zero-shot method, without minibatching, use:
# k = 3
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 3 --representation_small_k 2 --representation_ss_hidden_channels 32_DB,32_KF0 --representation_ss_last_layer_edge_type DB --representation_size 32
# k = 4
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 4 --representation_small_k 2,3 --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 2000 --representation_size 64
# k = 5
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 5 --representation_small_k 2,3,4 --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 1000 --representation_size 64
# k = 6
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_ssgnn --representation_ss_task CL --representation_k 6 --representation_small_k 2,5 --representation_ss_hidden_channels 128_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 500 --representation_size 64 --representation_ss_edges_keep_top_k 0.01
For zero-shot method, with minibatching, use:
# k = 7
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_ssgnn_miniBatch --representation_ss_task CL --representation_k 7 --representation_small_k 2,5 --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 100 --representation_size 64 --representation_ss_edges_keep_top_k 0.01 --representation_ss_edges_threshold 0.8 --representation_ss_batch_size 1024
# k = 8
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_ssgnn_miniBatch --representation_ss_task CL --representation_k 8 --representation_small_k 2,5 --representation_ss_hidden_channels 32_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 100 --representation_size 32 --representation_ss_edges_keep_top_k 0.01 --representation_ss_edges_threshold 0.8 --representation_ss_batch_size 256
For full list of hyperparameters see src/train/parsers.py
.
Use the --representation_data
flag to specify the dataset path, --retrieval_data
flag to specify the retrieval dataset path, and the --representation_k
flag for the desired k value. For method fine-tuned on Edit Distance Approximation, use:
# OneHot
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_onehot --representation_k 3 --model_class cnn1d
# Word2Vec
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_word2vec --representation_k 3 --model_class cnn1d
# Node2Vec
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --representation_method kmer_node2vec --representation_k 3 --model_class cnn1d
For zero-shot method, use:
# OneHot
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_onehot --representation_k 3 --model_class cnn1d
# Word2Vec
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_word2vec --representation_k 3 --model_class cnn1d
# Node2Vec
python -m src.train.main_editDistance --data path_to_dataset --retrieval_data path_to_retrieval_dataset --zero_shot_retrieval --representation_method kmer_node2vec --representation_k 3 --model_class cnn1d
For full list of hyperparameters see src/train/parsers.py
.
To specify the device type, use the --accelerator
flag. For example, to use a GPU, enter --accelerator gpu
.
To use FAISS for approximate nearest neighbor search [3] instead of cosine similarity to find nodes with close sub-k-mer frequency vectors, use the flag --representation_ss_faiss_ann
. For example, in the case of Edit Distance Approximation:
# k = 10
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn_miniBatch --representation_ss_faiss_ann --representation_ss_faiss_distance L2 --representation_ss_edges_keep_top_k 0.00008 --representation_k 10 --representation_small_k 2,5 --representation_ss_task CL --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 3 --representation_size 32 --model_class mlp
# k = 30
python -m src.train.main_editDistance --data path_to_dataset --representation_method kmer_ssgnn_miniBatch --representation_ss_faiss_ann --representation_ss_faiss_distance IP --representation_ss_edges_keep_top_k 0.00001 --representation_k 30 --representation_small_k 2,5 --representation_ss_task CL --representation_ss_hidden_channels 64_KF0 --representation_ss_last_layer_edge_type DB --representation_ss_epochs 3 --representation_size 32 --model_class mlp
Only supported for --representation_ss_task CL
(Contrastive Learning) and --representation_method kmer_ssgnn_miniBatch
To use Graph Autoencoder, replace the 'CL' with 'AE' in flag --representation_ss_task
: --representation_ss_task AE
.
Our tasks and datasets for Edit Distance Approximation and Closest String Retrieval were taken from Corso et al. [1]. The datasets can be obtained directly from the official repository of that paper. The directories of the datasets can be used directly with our flags --data
and --retrieval_data
.
*Supplementary Content: Outside the research scope of this paper:*our Gene Prediction task and datasets were taken from Silva et al. [2].
.
├── README.md
├── environment.yml # conda env
└── src
├── downstream_tasks # Folder defining downstream tasks
│ ├── coding_metagenomics # Supplementary Folder: *Outside the research scope of the paper*
│ │ ├── cnn1d.py
│ │ ├── coding_datasets.py
│ │ └── train.py
│ ├── datasets_factory # Reading datasets
│ │ ├── coding_metagenomics.py # Supplementary File: *Outside the research scope of the paper*
│ │ └── edit_distance.py # Reading datasets from Corso et al. [1]
│ └── edit_distance_models # EDIT DISTANCE APPROXIMATION and CLOSEST STRING RETRIEVAL tasks
│ ├── cnn1d.py
│ ├── distance_datasets.py
│ ├── distances.py # Hyperbolic Function
│ ├── mlp.py # Single Linear Layer by default
│ ├── retrieval_test.py # Closest String Retrieval tests
│ ├── train.py # Edit Distance Approximation
│ └── zero_shot_model.py # Concat, Mean, Max of k-mer embeddings
├── representations
│ ├── gnn_common # Models and Utils for Our Contrastive Learning Method
│ │ ├── gnn_models.py # GNN, other models *Outside the research scope of the paper*
│ │ └── gnn_utils.py # Edge Computations, including FAISS method
│ ├── gnn_tasks # Without Mini-Batching/Neighborhood sampling
│ │ ├── autoencoder_task.py # Supplementary Method: Appendix C in our paper
│ │ ├── sampling_task.py # GNN Contrastive Learning
│ │ └── utils.py
│ ├── gnn_tasks_miniBatch # With Mini-Batching/Neighborhood sampling
│ │ ├── autoencoder_task.py # Supplementary Method: Appendix C in our paper
│ │ ├── dataloader.py
│ │ ├── sampling_task.py # GNN Contrastive Learning
│ │ └── utils.py
│ ├── kmer_node2vec.py
│ ├── kmer_onehot.py
│ ├── kmer_ssgnn.py # Our Workflow Without Mini-Batching/Neighborhood sampling
│ ├── kmer_ssgnn_miniBatch.py # Our Workflow With Mini-Batching/Neighborhood sampling
│ ├── kmer_word2vec.py
│ └── representations_factory.py
├── train
│ ├── main_editDistance.py # Main Workflow
│ ├── main_geneFinder.py # Supplementary Task: *Outside the research scope of the paper*
│ ├── param_search_optuna.py # Can be used with yaml file for grid search
│ └── parsers.py
└── utils.py
-
Corso, G., Ying, Z., Pándy, M., Veličković, P., Leskovec, J., & Liò, P. (2021). Neural distance embeddings for biological sequences. Advances in Neural Information Processing Systems, 34, 18539-18551.
-
Silva, R., Padovani, K., Góes, F., & Alves, R. (2021). geneRFinder: gene finding in distinct metagenomic data complexities. BMC bioinformatics, 22(1), 1-17. BioMed Central.
-
Johnson, J., Douze, M. and Jégou, H., 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3), pp.535-547.