Skip to content

Commit

Permalink
More material about schema design
Browse files Browse the repository at this point in the history
  • Loading branch information
caufieldjh committed Oct 9, 2023
1 parent 3547dfb commit 51cebbc
Showing 1 changed file with 73 additions and 43 deletions.
116 changes: 73 additions & 43 deletions docs/custom.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,36 @@ imports:
The classes in the schema define the "things" you are interested in extracting. LinkML doesn't make many assumptions about the difference between a class and a relationship, a node and an edge, or a relation and a property. It's designed to be flexibile enough to handle a variety of data models.
The start of this section is indicated by `classes:`.

A minimal class may look like this:

```yaml
ClassName:
is_a: NamedEntity
attributes:
entity:
range: string
description: >-
A named entity.
```

In practice, this class won't do much, as it doesn't contain much for OntoGPT to work with or many instructions to form an LLM prompt out of. And that's fine, because we can do more.

These fields may be used in classes:

* `is_a`: This describes a hierarchical structure, so the value of this slot is the name of a LinkML class. `NamedEntity` is defined in OntoGPT's core schema and is will ensure extracted objects of this class have both unique identifiers and human readable labels.
* `tree_root`: If `true`, this class will be treated as the root of the data hierarchy. If you're planning to extract specific objects from a full text document, for example, it may be useful to define a class for the document to contain its metadata. This parent class could then be the `tree_root`.
* `attributes`: This slot defines all class attributes, and in OntoGPT, that means each will be included in a prompt for the LLM. Each attribute should have a unique, lowercased name. Attributes have their own slots:
* `description`: The attribute description to be *passed as part of the prompt*. This should describe the attribute and how it should be formatted in the generated output. Do not include references to specific identifiers here.
* `multivalued`: If `true`, any value for this attribute will be interpreted as a list. This is cruical if you expect multiple values in the extracted output and should be reflected in the description by indicating how each value should be separated. OntoGPT prefers semicolons.
* `range`: The class to restrict the object to. This may be an abstract data type like `string` or another class defined elsewhere in your schema, like `Gene` in the example below.
* `id_prefixes`: A list of identifiers to ground values of this class to. Usually specific to a class rather than an attribute. Use capitalized forms and omit the colon. If you want to ground to MeSH terms, for example, include the prefix `MESH`.
* `annotations`: This slot contains specific instructions for OntoGPT in its annotation and grounding operations. The heading `annotators`, placed under this slot, must contain a comma separated list of value annotators provided by the Ontology Access Kit (OAK). [In OAK these are called *implementations* or *adapters* and there are many of them available.](https://incatools.github.io/ontology-access-kit/packages/implementations/index.html). Annotators are responsible for bridging the gap between raw text and unique identifier, though that process may involve searching a combination of term lists along with their synonyms and equivalents.
* OBO Foundry ontologies make great annotators. To use CHEBI for chemical names, for example, use the annotator `sqlite:obo:chebi` and include `CHEBI` in the `id_prefixes` list.
* Ontologies in BioPortal work well, too. They may be specified with the BioPortal ID. To use the EnvThes ecological thesaurus, for example, use the annotator `bioportal:ENVTHES` and the prefix `ENVTHES`.
* `slot_usage`: This slot can contain rules about how another slot may be restricted. In the example below, `GeneLocation` has values for its `id` slot restricted to values within two different *enums*. See the next section for more information on how to use enums.

An example, continuing from where the header left off:

```yaml
Expand Down Expand Up @@ -171,12 +201,6 @@ classes:
range: MolecularActivity
annotations:
prompt: the name of the molecular function in the pair. This comes second. May be a GO term.
annotations:
prompt.example: |-
TODO
gene: HGNC:1234
molecular_activity: GO:0003674
GeneMolecularActivityRelationship2:
is_a: CompoundExpression
Expand Down Expand Up @@ -209,7 +233,15 @@ classes:
range: Gene
gene2:
range: Gene
```

### Enums

LinkML supports defining *enums*, or sets of values. In OntoGPT this allows schemas to work with subsets of identifiers. Enums have their own hierarchy. In the example below, the `reachable_from` slot is used to define sets of values: in `GOCellComponentType` these are all children of the GO term with the ID `GO:0005575` (cellular component), so restricting a set of identifiers based on this enum will ensure they all correspond to cellular components.

Example, starting where the classes left off above:

```yaml
enums:
GeneLocationEnum:
Expand All @@ -229,6 +261,10 @@ enums:
- CL:0000000 ## cell
```

### Schema design tips

It helps to have an understanding of the [LinkML](https://linkml.io) schema language, but it should be possible to define your own schemas using the examples in [src/ontogpt/templates](src/ontogpt/templates/) as a guide.

* Prompt hints can be specified using the `prompt` annotation (otherwise description is used)
* Multivalued fields are supported
* The default range is string — these are not grounded. Ex.: disease name, synonyms
Expand All @@ -237,8 +273,6 @@ enums:

We recommend following an established schema like [BioLink Model](https://github.com/biolink/biolink-model), but you can define your own.

Next step is to compile the schema. For that, you should place the schema YAML in the directory [src/ontogpt/templates/](src/ontogpt/templates/). Then, run the `make` command at the top level. This will compile the schema to Python (Pydantic classes).

Once you have defined your own schema / data model and placed in the correct directory, you can run the `extract` command.

Ex.:
Expand All @@ -247,41 +281,6 @@ Ex.:
ontogpt extract -t mendelian_disease.MendelianDisease -i marfan-wikipedia.txt
```

### Multiple levels of nesting

Currently no more than two levels of nesting are recommended.

If a field has a range which is itself a class and not a primitive, it will attempt to nest.

Ex. the `gocam` schema has an attribute:

```yaml
attributes:
...
gene_functions:
description: semicolon-separated list of gene to molecular activity relationships
multivalued: true
range: GeneMolecularActivityRelationship
```

The range `GeneMolecularActivityRelationship` has been specified _inline_, so it will nest.

The generated prompt is:

```bash
gene_functions : <semicolon-separated list of gene to molecular activities relationships>
```

The output of this is then passed through further SPIRES iterations.

### Text length limit

LLMs have context sizes limiting the combined length of their inputs and outputs. The `gpt-3.5-turbo` model, for example, has a 4,096 token limit (prompt + completion), while the `gpt-3.5-turbo-16k` model has a larger context of 16,384 tokens.

### Schema tips

It helps to have an understanding of the [LinkML](https://linkml.io) schema language, but it should be possible to define your own schemas using the examples in [src/ontogpt/templates](src/ontogpt/templates/) as a guide.

OntoGPT-specific extensions are specified as _annotations_.

You can specify a set of annotators for a field using the `annotators` annotation.
Expand Down Expand Up @@ -337,6 +336,37 @@ enums:
- CL:0000000 ## cell
```

#### Multiple levels of nesting

Currently no more than two levels of nesting are recommended.

If a field has a range which is itself a class and not a primitive, it will attempt to nest.

Ex. the `gocam` schema has an attribute:

```yaml
attributes:
...
gene_functions:
description: semicolon-separated list of gene to molecular activity relationships
multivalued: true
range: GeneMolecularActivityRelationship
```

The range `GeneMolecularActivityRelationship` has been specified _inline_, so it will nest.

The generated prompt is:

```bash
gene_functions : <semicolon-separated list of gene to molecular activities relationships>
```

The output of this is then passed through further SPIRES iterations.

#### Text length limit

LLMs have context sizes limiting the combined length of their inputs and outputs. The `gpt-3.5-turbo` model, for example, has a 4,096 token limit (prompt + completion), while the `gpt-3.5-turbo-16k` model has a larger context of 16,384 tokens.

## Install a custom schema

If you have installed OntoGPT directly from its GitHub repository, then you may install a custom schema like this:
Expand Down

0 comments on commit 51cebbc

Please sign in to comment.