GitHub - Wytz/migraine_biomarkers: The code and the data which were used for the identification of migraine biomarkers using a semantic knowledge graph

Wytz / migraine_biomarkers Public

Notifications You must be signed in to change notification settings
Fork 1
Star 0

The code and the data which were used for the identification of migraine biomarkers using a semantic knowledge graph

dx.doi.org/10.1016/j.jbi.2017.05.018

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Adjacent connections -- Transmog 31-10-2016.csv		Adjacent connections -- Transmog 31-10-2016.csv
Definite_CSF_compounds_migraine_node_IDs_16-01-2015.csv		Definite_CSF_compounds_migraine_node_IDs_16-01-2015.csv
Definite_serum_migraine_node_ID_16-01-2015_unique.csv		Definite_serum_migraine_node_ID_16-01-2015_unique.csv
Error analysis.R		Error analysis.R
Indirect_Transmog_all 31-10-2016.csv		Indirect_Transmog_all 31-10-2016.csv
Migraine Nieuw.Rproj		Migraine Nieuw.Rproj
Notebook.Rmd		Notebook.Rmd
Pharma_not_wanted_2.csv		Pharma_not_wanted_2.csv
README.Rmd		README.Rmd
Top compounds statistics transmog final filtered 04-11-2016.csv		Top compounds statistics transmog final filtered 04-11-2016.csv

Repository files navigation

# Introduction

This project describes the code that was used for the identification of migraine biomarkers with a semantic knowledge graph.
A link to the journal article is available in the project description.

A semantic knowledge graph describes how biomedical entities, such as diseases and drugs, are related to each other using subject-predicate-object triples.
Often, these triples also contain references to sources in which they are described; their provenance.
In the knowledge graph used for this experiment, this was required.

The semantic web can be considered an implementation of a knowledge graph.
Where they differ is that the semantic web sets a number of additional requirements, such as unique URI's for each concept.
Furthermore, semantic web implementations commonly have "exact match" predicates between two concepts that describe the same real-world entity.
In the knowledge graph used by us, these concepts are simply merged to a single concept.

# Experimental setup

## Use of the reference set

The reference set used in this experiment was only used to evaluate the results, and not as the training set.
Effectively, we pretended we only had a minimal knowledge about the desired outcome.
We therefore chose to use the information in the knowledge graph as weights, instead of examining their semantic meaning.

## Types of weights

We examined four types of weights:
1. The number of concepts connecting migraine to the potential biomarker
2. The number of predicates connecting migraine to the potential biomarker
3. The number of sources connecting migraine to the potential biomarker
4. A variation to the number of concepts, where the edges in the second step are also used. If intermediate concepts are connected to a large number of other concepts, this decreases the weight of the edge.

Where each weight was between 0 and 1.

Because identifying potential biomarkers that have a direct relationship with migraine is obvious, we only used information from the indirect relationships.
This information is more likely to be hidden in the bodies of journal articles, and not immediately and obviously mentioned in their titles or abstracts.
The weights of the two steps were therefore multiplied with each other to calculate a "probability" of ending up at that specific biomarker if you randomly followed the concepts/predicates/sources.
Surprisingly, we found little difference in overall performance between the different weights.
Some reference compounds were ranked higher than others depending on the ranking method, but overall, no "best" ranking method could be identified.

## The migraine subgraph

Because the UMLS covers a very wide range of concepts, ranging from organizations to atoms, we made a selection in the included types of concepts.
Not every concept type was considered to be relevant by the migraine researchers, who made this selection.
The ultimate list of included concept types is available in the article.

## Removing medications and highly generic concepts

Another challenge were the drugs.
Many molecules naturally available in the body (endogenic) can also be used as a drug.
Glucose is a good example of this.
We therefore had to create a list of compounds which do not occur naturally in the body (exogenic), which we used to filter the list of potential biomarkers.

Another filter we applied was to remove highly generic concepts, such as the concept for "Disease".
These are generic to such a degree that they only serve to introduce noise.
We like to think that we made common sense decisions there.

# Reproducability

A [notebook](http://wytze.info/migraine_biomarkers/) with the executed code is available in a notebook, which fully reproduces the results presented in the article.
The data I extracted from the knowledge graph is available in the project as CSV files.
If you want access to the knowledge graph itself, please send me an email.

The notebook also mentions the packages that were used, as well as the R version.