-
Notifications
You must be signed in to change notification settings - Fork 52
Adding profile specific relations to BioChemEntity and DataRecord
This document aims to help explain how we are adding Bioschemas specific relations to the BioChemEntity
and DataRecord
types, as this is particularly important both for publishers of generic life sciences data (such as the InterMine platform) and general life sciences search applications such as Buzzbang. Please feel free to edit this document as necessary and add yourself as a contributor.
Many of the examples here refer to the Protein
specification, which is a Bioschemas profile of the BioChemEntity
type
- Justin Clark-Casey
- Leyla Garcia Additional details on BioChemEntity, DataRecord and customized types, i.e., profiles.
- Everybody who contributed to the discussions or examples linked by this doc
Bioschemas is an initiative to embed schema.org-like metadata in webpages related to life sciences. This will promote use-cases such as improved findability of life sciences information.
Many of the Bioschemas specifications use existing schema.org types such as DataCatalog
and make them suitable for Bioschemas purposes by making certain relations mandatory, controlling their cardinality, etc. However, BioChemEntity
and DataRecord
are new types; while the former is designed specifically for the life sciences, the latter is generic enough to be broadly adopted (there is also LabProtocol
but I'm not so familiar with that so we won't further discuss it here). BioChemEntity
aims to describe biological, chemical and biochemical entities, while DataRecord
makes it easier to associate BioChemEntities to data records probably grouped together in a Dataset
or DataCatalog
.
The common relations on BioChemEntity
and DataRecord
are generic (contains, isContainedIn, isBasisFor) etc. The questions then are (1) How do we specify that a particular BioChemEntity
is a protein, or that a particular DataRecord
is for a sample? (2) how do we then give a protein BioChemEntity
protein-specific relations such as amino acid sequence, or a sample DataRecord
diagnoses available?
For the first question, in Bioschemas we offer profiles for BioChemEntity
, i.e., customizations of BioChemEntity, and together with them we propose an official supported type from a well-known ontology. Any other type of interest for data providers can be specify via the schema.org Thing.additionalType
relation, where the URL is an ontology/schema/controlled vocabulary term denoting the entity. For a protein BioChemEntity
this is something like
{
"@context": "http://schema.org",
"@type": ["BioChemEntity", "http://purl.obolibrary.org/obo/PR_000000001"],
"additionalType": "http://semanticscience.org/resource/SIO_010043",
"identifier": "P00519",
"name": "ABL1",
...
}
In this case, http://purl.obolibrary.org/obo/PR_000000001
is the official type for entities under the Protein profile while http://semanticscience.org/resource/SIO_010043
is an additional Semanticscience Integrated Ontology (SIO) term for protein. This latter additionalType
can be used by a provider to link an ontology they are already using. In the current specification draft it is recommended but will shortly become optional instead. See here for a more extensive protein example.
DataRecord
does not have any profiles at this time. However, you can find out the type of the underlying entity denoted by that DataRecord through the Thing.mainEntity
. Such a DataRecord
would look something like:
{
"@type": "DataRecord",
"@id": "http://www.identifiers.org/uniprot/P00519",
"identifier": "P00519",
"mainEntity": {
"@type": ["BioChemEntity","http://purl.obolibrary.org/obo/PR_000000001"],
"additionalType": "http://semanticscience.org/resource/SIO_010043",
"identifier": "P00519",
"name": "Tyrosine-protein kinase ABL1"
}
There are mechanisms for linking DataRecords
to BioChemEntities
rather than embedding a BioChemEntity
through JSON-LD's @id mechanism. For a more extensive but slightly out-of-date DataRecord example see this page.
So once we know the customized type (profile) for a BioChemEntity
or DataRecord
, how do we give it relations specific to that profile (e.g. protein or sample specific relations)? Whilst there has been extensive discussion on this question, I (justinccdev) don't believe it has yet been fully resolved. The 2 main alternatives are
The BioChemEntity
and DataRecord
Bioschemas types already have an additionalProperty
relation which is designed for adding arbitrary relations. For instance, a samples DataRecord
that wanted to add a diagnosis available relation may have the form
{
"@context": "http://schema.org",
"@type": ["DataRecord"],
"additionalProperty": [
{
"@type": "PropertyValue",
"name": "diagnosis_available",
"value": "urn:miriam:icd:C00-C97",
"valueReference": [
{
"@type": "CategoryCode",
"name": "Malignant neoplasms",
"url": "http://purl.bioontology.org/ontology/ICD10/C00-C97.9",
"codeValue": "C00-C97.9"
}
]
}
]
}
The possible pros of this approach are that:
- It's the easiest one where biobanks (hosting samples) can publish information. They don't have to agree on any pre-defined terms.
- All information is in the document itself. Links can be present but we don't rely on them to furnish useful data (unlike the Linked Data approach).
Possible cons:
- It doesn't allow easy use of existing validation tools for relations allow by a particular Bioschemas
BioChemEntity
profile (e.g. protein) - Information ends up being repeated (e.g. category code name)
- Adding information in other languages (e.g. translations of category code name) inflates file sizes.
One can also add relations from arbitrary schema directly, either inline or through a mechanism such as an additional JSON-LD context. For instance, an extreme example of the above would define a SamplesDataRecord
type to act as a 'profile' of DataRecord
and specific a diagnosisAvailable relation. The embedded JSON-LD may become
{
"@context": ["http://schema.org", "http://bioschemas.org/samples"],
"@type": ["SampleDataRecord"],
"diagnosisAvailable": [
"http://purl.bioontology.org/ontology/ICD10/C00-C97.9",
"http://purl.bioontology.org/ontology/ICD10/D00-D09.9"
]
}
with a file hosted at http://bioschemas.org/samples
containing the following
{
"@context": {
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
},
"@graph": [
"@id"; "http://bioschemas.org/samples",
{
"@id": "http://bioschemas.org/samples/SampleDataRecord",
"@type": "rdfs:Class",
"rdfs:subClassOf": { "@id": "http://schema.org/DataRecord" }
}
{
"@id": "http://bioschemas.org/samples/diagnosisAvailable",
"@type": "rdfs:Property",
"rdfs:label": "Diagnosis available",
"http://schema.org/domainIncludes": [
{
"@id": "http://bioschemas.org/samples/SamplesDataRecord"
},
"http://schema.org/rangeIncludes": [
{
"@id", "http://schema.org/URL"
}
]
}
]
}
This is an extremely example and relies on an application possibly visiting URLs such as http://purl.bioontology.org/ontology/ICD10/C00-C97.9
and retrieving structured data that the name of http://purl.bioontology.org/ontology/ICD10/C00-C97.9
is Malignant neoplasms
and its code C00-C97.9
, etc.
Possible pros of this approach:
- Using existing validation tools should be easier, for example that
SampleDataRecord
is a recognized bioschemas profile and thatdiagnosisAvailable
is a recognized relation in it, rather than having to code something custom ifadditionalProperty
entries need validation. - Information such as
name
andcodeValue
can be retrieved from a single canonical location rather than repeated in the text, possibly with mistakes. - Easier to put different language translations in a central file such as
http://bioschemas.org/samples
Possible cons:
- More complex for people doing Bioschemas markup, which is a much larger number with lower technical capacity than those writing consuming applications.
- Not so easy to add arbitrary properties not already defined upfront in Bioschemas. It is possible through direct inlining of vocabulary or additional contexts, but this is adding complexity and possible dependencies on further files in addition to
http://bioschemas.org/samples
-
http://bioschemas.org/samples
needs to be permanently and reliably available (though possibly could be served out of a github location instead).
(This discussion is currently from my (justinccdev) pov, so subjective)
It's an important point that direct specification of relations looks more complicated and a bit less flexible. An important goal of schema.org and hence Bioschemas, as discussed in Schema.org: Evolution of Structured Data on the Web, is that markup is easy for database publishers to create and developers often do it by adapting examples. The onus is on data consumers to do more legwork in cleaning things up and putting data together, as opposed to the Linked Data approach which puts more burden on the publishers. Hence, it may be better in certain situations (?) such as samples to use the additionalProperty
mechanism which is relatively simple for data publishers to implement.
However, there's no reason that both approaches can't be used simultaneously in Bioschemas. It may be better to use a direct specification approach where relations are well-known (such as amino acid sequence in proteins), to allow easier validation and make published data quality higher.
Arguably, it may not even be that important to get the absolutely best structure initially, as long as data publishers broadly agree on how they structured the embedded data, so that data consumers have something better to work with than scraping the webpage (even just having the data in a <script type="application/ld+json">
may be a significant benefit). When there is some data and people are using it, that provides the justification to spend more time on improving the representation if necessary.
- The original discussion on additionalProperty and direct relations by Alasdair and others
- A recent discussion on how to specify profile specific relations for samples
- Schema.org: Evolution of Structured Data on the Web - good background reading for the philosophical approach of schema.org
- General Bioschemas specifications page
- The BioSchemas
BioChemEntity
specification - The BioSchemas
DataRecord
specification - The BioSchemas
Protein
profile