Skip to content

Adding profile specific relations to BioChemEntity and DataRecord

Justin Clark-Casey edited this page Jun 25, 2018 · 15 revisions

Status

This document aims to help explain how we are adding Bioschemas specific relations to the BioChemEntity and DataRecord types, as this is particularly important both for publishers of generic life sciences data (such as the InterMine platform) and general life sciences search applications such as Buzzbang. Please feel free to edit this document as necessary and add yourself as a contributor.

Many of the examples here refer to the Protein specification, which is a Bioschemas profile of the BioChemEntity type

Authors

  • Justin Clark-Casey
  • Leyla Garcia Additional details on BioChemEntity, DataRecord and customized types, i.e., profiles.
  • Everybody who contributed to the discussions or examples linked by this doc

Introduction

Bioschemas is an initiative to embed schema.org-like metadata in webpages related to life sciences. This will promote use-cases such as improved findability of life sciences information.

Many of the Bioschemas specifications use existing schema.org types such as DataCatalog and make them suitable for Bioschemas purposes by making certain relations mandatory, controlling their cardinality, etc. However, BioChemEntity and DataRecord are new types; while the former is designed specifically for the life sciences, the latter is generic enough to be broadly adopted (there is also LabProtocol but I'm not so familiar with that so we won't further discuss it here). BioChemEntity aims to describe biological, chemical and biochemical entities, while DataRecord makes it easier to associate BioChemEntities to data records probably grouped together in a Dataset or DataCatalog.

The common relations on BioChemEntity and DataRecord are generic (contains, isContainedIn, isBasisFor) etc. The questions then are (1) How do we specify that a particular BioChemEntity is a protein, or that a particular DataRecord is for a sample? (2) how do we then give a protein BioChemEntity protein-specific relations such as amino acid sequence, or a sample DataRecord diagnoses available?

Specifying the type of a BioChemEntity or DataRecord

For the first question, in Bioschemas we offer profiles for BioChemEntity, i.e., customizations of BioChemEntity, and together with them we propose an official supported type from a well-known ontology. Any other type of interest for data providers can be specify via the schema.org Thing.additionalType relation, where the URL is an ontology/schema/controlled vocabulary term denoting the entity. For a protein BioChemEntity this is something like

{
    "@context": "http://schema.org",
    "@type": ["BioChemEntity", "http://purl.obolibrary.org/obo/PR_000000001"],
    "additionalType": "http://semanticscience.org/resource/SIO_010043",
    "identifier": "P00519",
    "name": "ABL1",
    ...
}

In this case, http://purl.obolibrary.org/obo/PR_000000001 is the official type for entities under the Protein profile while http://semanticscience.org/resource/SIO_010043 is an additional Semanticscience Integrated Ontology (SIO) term for protein. This latter additionalType can be used by a provider to link an ontology they are already using. In the current specification draft it is recommended but will shortly become optional instead. See here for a more extensive protein example.

DataRecord does not have any profiles at this time. However, you can find out the type of the underlying entity denoted by that DataRecord through the Thing.mainEntity. Such a DataRecord would look something like:

{
  "@type": "DataRecord",
  "@id": "http://www.identifiers.org/uniprot/P00519",
  "identifier": "P00519",
  "mainEntity": {
    "@type": ["BioChemEntity","http://purl.obolibrary.org/obo/PR_000000001"],
    "additionalType": "http://semanticscience.org/resource/SIO_010043",
    "identifier": "P00519",
    "name": "Tyrosine-protein kinase ABL1"
}

There are mechanisms for linking DataRecords to BioChemEntities rather than embedding a BioChemEntity through JSON-LD's @id mechanism. For a more extensive but slightly out-of-date DataRecord example see this page.

Specifying relations for a customized type (profile) of BioChemEntity or DataRecord

So once we know the customized type (profile) for a BioChemEntity or DataRecord, how do we give it relations specific to that profile (e.g. protein or sample specific relations)? Whilst there has been extensive discussion on this question, I (justinccdev) don't believe it has yet been fully resolved. The 2 main alternatives are

1. Use the BioChemEntity.additionalProperty and DataRecord.additionalProperty mechanisms

The BioChemEntity and DataRecord Bioschemas types already have an additionalProperty relation which is designed for adding arbitrary relations. For instance, a samples DataRecord that wanted to add a diagnosis available relation may have the form

{
  "@context": "http://schema.org",
  "@type": ["DataRecord"],
  "additionalProperty": [
    {
      "@type": "PropertyValue",
      "name": "diagnosis_available",
      "value": "urn:miriam:icd:C00-C97",
      "valueReference": [
        {
          "@type": "CategoryCode",
          "name": "Malignant neoplasms",
          "url": "http://purl.bioontology.org/ontology/ICD10/C00-C97.9",
          "codeValue": "C00-C97.9"
        }
      ]
    }
  ]
}

The possible pros of this approach are that:

  • It's the easiest one where biobanks (hosting samples) can publish information. They don't have to agree on any pre-defined terms.
  • All information is in the document itself. Links can be present but we don't rely on them to furnish useful data (unlike the Linked Data approach).

Possible cons:

  • It doesn't allow easy use of existing validation tools for relations allow by a particular Bioschemas BioChemEntity profile (e.g. protein)
  • Information ends up being repeated (e.g. category code name)
  • Adding information in other languages (e.g. translations of category code name) inflates file sizes.

2. Add additional relations directly

One can also add relations from arbitrary schema directly, either inline or through a mechanism such as an additional JSON-LD context. For instance, an extreme example of the above would define a SamplesDataRecord type to act as a 'profile' of DataRecord and specific a diagnosisAvailable relation. The embedded JSON-LD may become

{
    "@context": ["http://schema.org", "http://bioschemas.org/samples"],
    "@type": ["SampleDataRecord"],
    "diagnosisAvailable": [
        "http://purl.bioontology.org/ontology/ICD10/C00-C97.9",
        "http://purl.bioontology.org/ontology/ICD10/D00-D09.9"
    ]  
}

with a file hosted at http://bioschemas.org/samples containing the following

{
  "@context": {
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
  },
  "@graph": [
    "@id"; "http://bioschemas.org/samples",
    {
      "@id": "http://bioschemas.org/samples/SampleDataRecord",
      "@type": "rdfs:Class",
      "rdfs:subClassOf": { "@id": "http://schema.org/DataRecord" }
    }
    {
      "@id": "http://bioschemas.org/samples/diagnosisAvailable",
      "@type": "rdfs:Property",
      "rdfs:label": "Diagnosis available",
      "http://schema.org/domainIncludes": [
        {
          "@id": "http://bioschemas.org/samples/SamplesDataRecord"
        },
      "http://schema.org/rangeIncludes": [
        {
          "@id", "http://schema.org/URL"
        }
      ]
    }
  ]
}

This is an extremely example and relies on an application possibly visiting URLs such as http://purl.bioontology.org/ontology/ICD10/C00-C97.9 and retrieving structured data that the name of http://purl.bioontology.org/ontology/ICD10/C00-C97.9 is Malignant neoplasms and its code C00-C97.9, etc.

Possible pros of this approach:

  • Using existing validation tools should be easier, for example that SampleDataRecord is a recognized bioschemas profile and that diagnosisAvailable is a recognized relation in it, rather than having to code something custom if additionalProperty entries need validation.
  • Information such as name and codeValue can be retrieved from a single canonical location rather than repeated in the text, possibly with mistakes.
  • Easier to put different language translations in a central file such as http://bioschemas.org/samples

Possible cons:

  • More complex for people doing Bioschemas markup, which is a much larger number with lower technical capacity than those writing consuming applications.
  • Not so easy to add arbitrary properties not already defined upfront in Bioschemas. It is possible through direct inlining of vocabulary or additional contexts, but this is adding complexity and possible dependencies on further files in addition to http://bioschemas.org/samples
  • http://bioschemas.org/samples needs to be permanently and reliably available (though possibly could be served out of a github location instead).

Discussion

(This discussion is currently from my (justinccdev) pov, so subjective)

It's an important point that direct specification of relations looks more complicated and a bit less flexible. An important goal of schema.org and hence Bioschemas, as discussed in Schema.org: Evolution of Structured Data on the Web, is that markup is easy for database publishers to create and developers often do it by adapting examples. The onus is on data consumers to do more legwork in cleaning things up and putting data together, as opposed to the Linked Data approach which puts more burden on the publishers. Hence, it may be better in certain situations (?) such as samples to use the additionalProperty mechanism which is relatively simple for data publishers to implement.

However, there's no reason that both approaches can't be used simultaneously in Bioschemas. It may be better to use a direct specification approach where relations are well-known (such as amino acid sequence in proteins), to allow easier validation and make published data quality higher.

Arguably, it may not even be that important to get the absolutely best structure initially, as long as data publishers broadly agree on how they structured the embedded data, so that data consumers have something better to work with than scraping the webpage (even just having the data in a <script type="application/ld+json"> may be a significant benefit). When there is some data and people are using it, that provides the justification to spend more time on improving the representation if necessary.

References