Predicting Gene-Disease Associations with Knowledge Graph Embeddings over Multiple Ontologies

Introduction

Ontology-based approaches for predicting gene-disease associations include the more classical semantic similarity methods and more recently knowledge graph embeddings. While semantic similarity is typically restricted to hierarchical relations within the ontology, knowledge graph embeddings consider their full breadth. However, embeddings are produced over a single graph and complex tasks such as gene-disease association may require additional ontologies. We investigate the impact of employing richer semantic representations that are based on more than one ontology, able to represent both genes and diseases and consider multiple kinds of relations within the ontologies. Our experiments demonstrate the value of employing knowledge graph embeddings based on random-walks and highlight the need for a closer integration of different ontologies.

This document provides the implementation described in the short paper: http://arxiv.org/abs/2105.04944

Dataset and Annotations

Dataset_Pairs_Label.csv has a total of 2716 genes, 1807 diseases, and 8189 disease-genes relations from DisGeNET and 8189 negative samples. GO annotations were downloaded from Gene Ontology Annotation (GOA) database for the human species. HP annotations were downloaded from the HP database, providing links between genes or diseases to HP terms. The original file used of Disgenet curated_Disgenet.csv is also available.

Indexes Creation

We performed a stratified 70% training and 30% testing split (inside 3.Performance ML folder), with the same split being used throughout all experiments, including the baseline.
We also performed a stratified ten-fold cross-validation (inside crossvalidation_10-fold folder) being that the same folds and, for each fold, the Weighted Average of F-measures (WAF) of classifications were assessed and reported in the form of a median with Median-Calculation.py.

Baseline

Uses the SSMC tool (more details available in https://github.com/liseda-lab/SSMC) and six different semantic similarity measures:

BMA ICSeco
BMA ICResnik
SimGIC ICSeco
SimGIC ICResnik
Max ICSeco
MAX ICResnik

SSMC accepts as input a JSON file with a series of mandatory (and optional) user defined configurations. An example has been provided in Configuration.json in the SSMC Tool folder. The association prediction performed in Threshold_Baseline.py (SS_Baseline folder)is expressed as a classification problem where a score for a gene-disease pair exceeding a certain threshold indicates a positive association.

Running Embeddings

KGEs with 200 features and five different methods that cover different approaches for KGE:

Translational Distance (TransE)
Semantic Matching (DistMult)
Random Walk-based (RDF2Vec, OPA2Vec, OWL2Vec*)

-- TransE and DistMult implementation in https://github.com/thunlp/OpenKE with default parameters.

-- RDF2Vec implementation in https://github.com/IBCNServices/pyRDF2Vec, the sequences are generated using the Weisfeiler-Lehman algorithm with walks depth 8 and a limited number of 500. The corpora of sequences were used to build a Skip-Gram model with the default parameters.

-- OPA2Vec implementation in https://github.com/bio-ontology-research-group/opa2vec with default parameters.

-- OWL2Vec* implementation in https://github.com/KRR-Oxford/OWL2Vec-Star with RDF2Vec parameters.

Vector Operations

Each gene-disease pair corresponds to two vectors, g and d, associated with a gene and a disease, respectively. We defined an operator over the corresponding vectors in order to generate a representation r(g,d). Several choices for the operator were considered:

Concatenation
Average
Hadamard
Weighted-L1
Weighted-L2

We measured the cosine similarity between the vectors carrying out the same approach used in the Baseline with a SS threshold.

Running Perfomance

The resulting vectors were then the input to four different ML algorithms:

Random Forest
eXtreme Gradient Boosting
Naïve Bayes
Multi-Layer Perceptron

To use the 10-fold cross validation:

Performance_ML_70-30split.py

We also provide a Median-Calculation.py (3.Performance ML folder) to calculate the median result of the 10 partitions.

To use the 70-30 split:

Performance_ML_10fold.py

Grid search was employed in both Performance_ML_70-30split.py and Performance_ML_10fold.py to obtain optimal parameters for RF, XGB, and MLP.

Ontology Download

Human Phenotype Ontology: https://hpo.jax.org/app/data/ontology
Gene Ontology: http://geneontology.org/docs/download-ontology/

HOW TO

In the folder DEMO, we provide a step by Step_by_Step.pdf with guiding steps for a successful implementation.

Authors

Susana Nunes
Rita T. Sousa
Catia Pesquita

For any comments or help needed with this implementation, please send an email to: [email protected]

Acknowledgments

This work was supported by FCT through LASIGE Research Unit (ref. UIDB/00408/2020 and ref. UIDP/00408/2020. It was also partially supported by the KATY project which has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017453.

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
0. DEMO		0. DEMO
1. Embeddings_calculation		1. Embeddings_calculation
2. Vector Operations an CS		2. Vector Operations an CS
3. Performance ML		3. Performance ML
Crossvalidation_10-fold		Crossvalidation_10-fold
SS_Baseline		SS_Baseline
Dataset_Pairs_Label.csv		Dataset_Pairs_Label.csv
README.md		README.md
curated_Disgenet.csv		curated_Disgenet.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Gene-Disease Associations with Knowledge Graph Embeddings over Multiple Ontologies

Introduction

Dataset and Annotations

Indexes Creation

Baseline

Running Embeddings

Vector Operations

Running Perfomance

Ontology Download

HOW TO

Authors

Acknowledgments

About

Languages

liseda-lab/KGE_Predictions_GD

Folders and files

Latest commit

History

Repository files navigation

Predicting Gene-Disease Associations with Knowledge Graph Embeddings over Multiple Ontologies

Introduction

Dataset and Annotations

Indexes Creation

Baseline

Running Embeddings

Vector Operations

Running Perfomance

Ontology Download

HOW TO

Authors

Acknowledgments

About

Resources

Stars

Watchers

Forks

Languages