forked from JervenBolleman/FALDO-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
results.tex
96 lines (82 loc) · 4.19 KB
/
results.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
\section*{Results}
One of the practical goal driving the development of FALDO was to be able
to represent all the annotated sequences in UniProt and the INSDC as RDF
triples, as a step towards providing this data via SPARQL end points whereby
it can be queried.
UniProt is already available as RDF, and a revised version of the full
dataset using FALDO has already tested and is expected to be included
in the UniProt XXX release \textit{TODO - fill in version and data}.
The protein examples considered, such as the UniProt feature annotations,
require relatively simple locations within protein sequences to be described.
There are corner cases for situations like unknown boundaries.
\textit{TODO - examples - how about cysteine bonds in PDB data?}.
Feature locations on nucleotide sequences can be relatively simple,
%GenBank
for example the \textit{cheY} gene in
\textit{Escherichia coli} str. K-12 substr. MG1655 (accession NC\_000913.2)
is described in the INSDC feature table as \texttt{complement(1965072..1965461)},
which is 390 base pairs using inclusive one-based counting.
This is interpreted as \texttt{complement($end$..$start$)}, where
the feature begins on the base complementary to $start = 1965461$
and finishes at $end = 1965072$. FALDO respects this biological
interpretation of a feature location on the reverse strand:
%Notice begin, end = 1965461, 1965072 (biological interpretation)
\begin{shaded}
\begin{verbatim}
# Incomplete Turtle format triples example using FALDO to describe
# gene chrY being at complement(NC_000913.2:1965072..1965461)
@prefix faldo: <http://biohackathon.org/resource/faldo#> .
# Here are example triples describing a gene, note the location line:
<_:geneCheY> a so:Gene ;
rdfs:label "cheY" ;
faldo:location <_:example> ;
# The following triples define the location used by the feature above,
# made up of a region and its two positions:
<_:example> a faldo:Region ;
faldo:begin <_:example_b> ;
faldo:end <_:example_e> .
<_:example_b> a faldo:Position ;
a faldo:ExactlyKnownPosition ;
a faldo:ReverseStrandPosition ;
faldo:position "1965461"^^xsd:int ;
faldo:reference refseq:NC_000913.2 .
<_:example_e> a faldo:Position ;
a faldo:ExactlyKnownPosition ;
a faldo:ReverseStrandPosition ;
faldo:position "1965072"^^xsd:int ;
faldo:reference refseq:NC_000913.2 .
\end{verbatim}
\end{shaded}
In contrast, other formats like GFF require $start \leq end$
regardless of the strand, which is equivalent to interpreting
the INSDC location string as \texttt{complement($start$..$end$)}.
This convention has a number of practical advantages when
dealing with numerical operations on features sets, such as
checking for overlaps or indexing data. For example, the
feature length is given by $length = end - start + 1$ under
this numerically convenient scheme where the interpretation
of $start$ versus $end$ is strand independent.
Note that the object model used treats the strand and reference
as properties of the start and end positions of a region. This is
not possible in the INSDC feature tables or the GFF formats,
so a stricter version of FALDO includes an OWL constraint
on the region class that the start and end properties have
the same reference and strand information.
\textit{TODO - implement that}
Nucleotide data from the INSDC can require considerably more complicated
sequence regions, both in terms of ambiguous end points, and the need for
various compound locations.
\textit{TODO - examples}.
All of the UniProt protein sequences are now available in RDF format,
with a SPARQL end point currently in beta. Converting more recent
curated INSDC data has been straightforward, however there are a
number of deprecated location strings which have been problematic.
\textit{TODO - examples? manual curation to fix last few oddities?}
\textit{TODO - other ideas:}
\begin{itemize}
\item Example of Cys bonding between one or two proteins?
\item RNA/DNA digest sites, and the cut sites -- especially for over hangs (start and ends on opposite strands)
\item tRNA hairpin (two complementary regions)?
\item $\beta$-sheet in protein?
\item RNA-Seq mapping (useful for SAM/BAM)?
\end{itemize}