More material about schema design

monarch-initiative · Oct 9, 2023 · 51cebbc · 51cebbc
1 parent 3547dfb
commit 51cebbc
Showing 1 changed file with 73 additions and 43 deletions.
diff --git a/docs/custom.md b/docs/custom.md
@@ -50,6 +50,36 @@ imports:
 
 The classes in the schema define the "things" you are interested in extracting. LinkML doesn't make many assumptions about the difference between a class and a relationship, a node and an edge, or a relation and a property. It's designed to be flexibile enough to handle a variety of data models.
 
+The start of this section is indicated by `classes:`.
+
+A minimal class may look like this:
+
+```yaml
+  ClassName:
+    is_a: NamedEntity
+    attributes:
+      entity:
+        range: string
+        description: >- 
+          A named entity.
+```
+
+In practice, this class won't do much, as it doesn't contain much for OntoGPT to work with or many instructions to form an LLM prompt out of. And that's fine, because we can do more.
+
+These fields may be used in classes:
+
+* `is_a`: This describes a hierarchical structure, so the value of this slot is the name of a LinkML class. `NamedEntity` is defined in OntoGPT's core schema and is will ensure extracted objects of this class have both unique identifiers and human readable labels.
+* `tree_root`: If `true`, this class will be treated as the root of the data hierarchy. If you're planning to extract specific objects from a full text document, for example, it may be useful to define a class for the document to contain its metadata. This parent class could then be the `tree_root`.
+* `attributes`: This slot defines all class attributes, and in OntoGPT, that means each will be included in a prompt for the LLM. Each attribute should have a unique, lowercased name. Attributes have their own slots:
+  * `description`: The attribute description to be *passed as part of the prompt*. This should describe the attribute and how it should be formatted in the generated output. Do not include references to specific identifiers here.
+  * `multivalued`: If `true`, any value for this attribute will be interpreted as a list. This is cruical if you expect multiple values in the extracted output and should be reflected in the description by indicating how each value should be separated. OntoGPT prefers semicolons.
+  * `range`: The class to restrict the object to. This may be an abstract data type like `string` or another class defined elsewhere in your schema, like `Gene` in the example below.
+* `id_prefixes`: A list of identifiers to ground values of this class to. Usually specific to a class rather than an attribute. Use capitalized forms and omit the colon. If you want to ground to MeSH terms, for example, include the prefix `MESH`.
+* `annotations`: This slot contains specific instructions for OntoGPT in its annotation and grounding operations. The heading `annotators`, placed under this slot, must contain a comma separated list of value annotators provided by the Ontology Access Kit (OAK). [In OAK these are called *implementations* or *adapters* and there are many of them available.](https://incatools.github.io/ontology-access-kit/packages/implementations/index.html). Annotators are responsible for bridging the gap between raw text and unique identifier, though that process may involve searching a combination of term lists along with their synonyms and equivalents.
+  * OBO Foundry ontologies make great annotators. To use CHEBI for chemical names, for example, use the annotator `sqlite:obo:chebi` and include `CHEBI` in the `id_prefixes` list.
+  * Ontologies in BioPortal work well, too. They may be specified with the BioPortal ID. To use the EnvThes ecological thesaurus, for example, use the annotator `bioportal:ENVTHES` and the prefix `ENVTHES`.
+* `slot_usage`: This slot can contain rules about how another slot may be restricted. In the example below, `GeneLocation` has values for its `id` slot restricted to values within two different *enums*. See the next section for more information on how to use enums.
+
 An example, continuing from where the header left off:
 
 ```yaml
@@ -171,12 +201,6 @@ classes:
         range: MolecularActivity
         annotations:
           prompt: the name of the molecular function in the pair. This comes second. May be a GO term.
-    annotations:
-      prompt.example: |-
-        TODO
-        
-        gene: HGNC:1234
-        molecular_activity: GO:0003674
 
   GeneMolecularActivityRelationship2:
     is_a:   CompoundExpression
@@ -209,7 +233,15 @@ classes:
         range: Gene
       gene2:
         range: Gene
+```
+
+### Enums
+
+LinkML supports defining *enums*, or sets of values. In OntoGPT this allows schemas to work with subsets of identifiers. Enums have their own hierarchy. In the example below, the `reachable_from` slot is used to define sets of values: in `GOCellComponentType` these are all children of the GO term with the ID `GO:0005575` (cellular component), so restricting a set of identifiers based on this enum will ensure they all correspond to cellular components.
 
+Example, starting where the classes left off above:
+
+```yaml
 enums:
 
   GeneLocationEnum:
@@ -229,6 +261,10 @@ enums:
         - CL:0000000 ## cell
 ```
 
+### Schema design tips
+
+It helps to have an understanding of the [LinkML](https://linkml.io) schema language, but it should be possible to define your own schemas using the examples in [src/ontogpt/templates](src/ontogpt/templates/) as a guide.
+
 * Prompt hints can be specified using the `prompt` annotation (otherwise description is used)
 * Multivalued fields are supported
 * The default range is string — these are not grounded. Ex.: disease name, synonyms
@@ -237,8 +273,6 @@ enums:
 
 We recommend following an established schema like [BioLink Model](https://github.com/biolink/biolink-model), but you can define your own.
 
-Next step is to compile the schema. For that, you should place the schema YAML in the directory [src/ontogpt/templates/](src/ontogpt/templates/). Then, run the `make` command at the top level. This will compile the schema to Python (Pydantic classes).
-
 Once you have defined your own schema / data model and placed in the correct directory, you can run the `extract` command.
 
 Ex.:
@@ -247,41 +281,6 @@ Ex.:
 ontogpt extract -t mendelian_disease.MendelianDisease -i marfan-wikipedia.txt
 ```
 
-### Multiple levels of nesting
-
-Currently no more than two levels of nesting are recommended.
-
-If a field has a range which is itself a class and not a primitive, it will attempt to nest.
-
-Ex. the `gocam` schema has an attribute:
-
-```yaml
-  attributes:
-      ...
-      gene_functions:
-        description: semicolon-separated list of gene to molecular activity relationships
-        multivalued: true
-        range: GeneMolecularActivityRelationship
-```
-
-The range `GeneMolecularActivityRelationship` has been specified _inline_, so it will nest.
-
-The generated prompt is:
-
-```bash
-gene_functions : <semicolon-separated list of gene to molecular activities relationships>
-```
-
-The output of this is then passed through further SPIRES iterations.
-
-### Text length limit
-
-LLMs have context sizes limiting the combined length of their inputs and outputs. The `gpt-3.5-turbo` model, for example, has a 4,096 token limit (prompt + completion), while the `gpt-3.5-turbo-16k` model has a larger context of 16,384 tokens.
-
-### Schema tips
-
-It helps to have an understanding of the [LinkML](https://linkml.io) schema language, but it should be possible to define your own schemas using the examples in [src/ontogpt/templates](src/ontogpt/templates/) as a guide.
-
 OntoGPT-specific extensions are specified as _annotations_.
 
 You can specify a set of annotators for a field using the `annotators` annotation.
@@ -337,6 +336,37 @@ enums:
         - CL:0000000 ## cell
 ```
 
+#### Multiple levels of nesting
+
+Currently no more than two levels of nesting are recommended.
+
+If a field has a range which is itself a class and not a primitive, it will attempt to nest.
+
+Ex. the `gocam` schema has an attribute:
+
+```yaml
+  attributes:
+      ...
+      gene_functions:
+        description: semicolon-separated list of gene to molecular activity relationships
+        multivalued: true
+        range: GeneMolecularActivityRelationship
+```
+
+The range `GeneMolecularActivityRelationship` has been specified _inline_, so it will nest.
+
+The generated prompt is:
+
+```bash
+gene_functions : <semicolon-separated list of gene to molecular activities relationships>
+```
+
+The output of this is then passed through further SPIRES iterations.
+
+#### Text length limit
+
+LLMs have context sizes limiting the combined length of their inputs and outputs. The `gpt-3.5-turbo` model, for example, has a 4,096 token limit (prompt + completion), while the `gpt-3.5-turbo-16k` model has a larger context of 16,384 tokens.
+
 ## Install a custom schema
 
 If you have installed OntoGPT directly from its GitHub repository, then you may install a custom schema like this: