-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
62 lines (44 loc) · 3.95 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Introduction
This project describes the code that was used for the identification of migraine biomarkers with a semantic knowledge graph.
A link to the journal article is available in the project description.
A semantic knowledge graph describes how biomedical entities, such as diseases and drugs, are related to each other using subject-predicate-object triples.
Often, these triples also contain references to sources in which they are described; their provenance.
In the knowledge graph used for this experiment, this was required.
The semantic web can be considered an implementation of a knowledge graph.
Where they differ is that the semantic web sets a number of additional requirements, such as unique URI's for each concept.
Furthermore, semantic web implementations commonly have "exact match" predicates between two concepts that describe the same real-world entity.
In the knowledge graph used by us, these concepts are simply merged to a single concept.
# Experimental setup
## Use of the reference set
The reference set used in this experiment was only used to evaluate the results, and not as the training set.
Effectively, we pretended we only had a minimal knowledge about the desired outcome.
We therefore chose to use the information in the knowledge graph as weights, instead of examining their semantic meaning.
## Types of weights
We examined four types of weights:
1. The number of concepts connecting migraine to the potential biomarker
2. The number of predicates connecting migraine to the potential biomarker
3. The number of sources connecting migraine to the potential biomarker
4. A variation to the number of concepts, where the edges in the second step are also used. If intermediate concepts are connected to a large number of other concepts, this decreases the weight of the edge.
Where each weight was between 0 and 1.
Because identifying potential biomarkers that have a direct relationship with migraine is obvious, we only used information from the indirect relationships.
This information is more likely to be hidden in the bodies of journal articles, and not immediately and obviously mentioned in their titles or abstracts.
The weights of the two steps were therefore multiplied with each other to calculate a "probability" of ending up at that specific biomarker if you randomly followed the concepts/predicates/sources.
Surprisingly, we found little difference in overall performance between the different weights.
Some reference compounds were ranked higher than others depending on the ranking method, but overall, no "best" ranking method could be identified.
## The migraine subgraph
Because the UMLS covers a very wide range of concepts, ranging from organizations to atoms, we made a selection in the included types of concepts.
Not every concept type was considered to be relevant by the migraine researchers, who made this selection.
The ultimate list of included concept types is available in the article.
## Removing medications and highly generic concepts
Another challenge were the drugs.
Many molecules naturally available in the body (endogenic) can also be used as a drug.
Glucose is a good example of this.
We therefore had to create a list of compounds which do not occur naturally in the body (exogenic), which we used to filter the list of potential biomarkers.
Another filter we applied was to remove highly generic concepts, such as the concept for "Disease".
These are generic to such a degree that they only serve to introduce noise.
We like to think that we made common sense decisions there.
# Reproducability
A [notebook](http://wytze.info/migraine_biomarkers/) with the executed code is available in a notebook, which fully reproduces the results presented in the article.
The data I extracted from the knowledge graph is available in the project as CSV files.
If you want access to the knowledge graph itself, please send me an email.
The notebook also mentions the packages that were used, as well as the R version.