Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in airr-schema.yaml / airr-schema-openapi3.yaml #813

Open
8 of 15 tasks
LonnekeScheffer opened this issue Nov 12, 2024 · 6 comments · May be fixed by #773, #814 or #815
Open
8 of 15 tasks

bugs in airr-schema.yaml / airr-schema-openapi3.yaml #813

LonnekeScheffer opened this issue Nov 12, 2024 · 6 comments · May be fixed by #773, #814 or #815

Comments

@LonnekeScheffer
Copy link

LonnekeScheffer commented Nov 12, 2024

Some issues encountered when auto-generating LinkML from airr-schema.yaml / airr-schema-openapi3.yaml
These are all relatively easy/small fixes.

Issues for both AIRR version 1.5 and 2.0

  • Property release_version has inconsistent type. In AlleleDescription, type=integer; in GermlineSet, type=number. Should be integer in both cases.
  • Ontology species is missing source node in some places. This ontology is correctly defined in Subject, but in AlleleDescription and GermlineSet, the expected field x-airr/ontology/top_node/id is missing.
  • (does not need to be handled here, see comments) Some enums are defined several times, and differ in that they contain 'null' in some places, but not in others: locus (contains 'null' in Rearrangement, but not in AlleleDescription, GermlineSet, Genotype); mhc_class (contains 'null' in Reactivity, not in MHCGenotype)
  • For Ontology 'Property' in class 'CellExpression' (v1.5) / 'Expression' (v2.0) the expected field x-airr/ontology/top_node/id is missing.
  • For ontology antigen in class. 'ReceptorReactivity' (v1.5) / 'Reactivity' (v2.0) the expected field x-airr/ontology/top_node/id is missing.
  • (does not need to be handled here, see comments) Property nodes of class Tree should have $ref: '#/Node' directly under nodes, now it is placed under a field called additionalProperties which is not used anywhere else
  • RepertoireGroup contains property repertoires. This is not simply an array of Repertoire objects, but rather, an array of items which are unnamed (!) objects, where each object contains a repertoire_id, repertoire_description, time_point. There are no other comparable occurrences in the schema files where new objects are defined in such a way. Since this object is listed under 'items', I cannot infer a name for this object when auto-generating LinkML. Therefore, I would suggest creating a new named class, e.g., RepertoireReference, which contains these three fields, and is used as a 'type' for this array.
  • New issue identified: the field 'name' of 'Acknowledgement' clashes with the AKC (means a 'Full name of individual' in AIRR but a 'name for a thing' in AKC). I would recommend changing 'name' to something like 'person_name'
  • For class TimePoint, field 'unit' is an Ontology (UO:0000033 time unit). This name is too generic and clashes with the AKC. Should be renamed time_unit instead.
  • Similarly to the above point, TimePoint has a field called 'value' which clashes in meaning with the AKC. Should be renamed time_value instead.
  • field 'sequencing_run_date' of class SequencingRun has format: date, whereas all other dates have format: date-time. While this does not cause direct issues, I would recommend changing this format to date-time for consistency.

Issues for AIRR version 2.0 only

  • Property unit occurs as an ontology across different classes. This ontology does not have the same meaning. E.g.: TimePoint (unit=UO:0000003), TimeInterval (unit=UO:0000033), PhysicalQuantity (unit=UO:0000024), TimeQuantity (unit=UO:0000033). The field 'unit' should be renamed to something unique (time_unit, time_interval_unit, physical_quantity_unit)
  • For ontology orcid_id in class Contributor the expected field x-airr/ontology/top_node/id is null.
  • For ontology affiliation in class Contributor the expected field x-airr/ontology/top_node/id is null.
  • Property reactivity_ref is defined with two different types. Inside class Reactivity, the type is array. Inside class Rearrangement, the type is string, but the description explicitly states that this string should be a comma separated list. I would propose changing the type to array. (and besides, Rearrangement also contains reactivity_id which is a comma separated list as string, the type for that should then probably change as well.)
@schristley
Copy link
Member

  • RepertoireGroup contains property repertoires. This is not simply an array of Repertoire objects, but rather, an array of items which are unnamed (!) objects, where each object contains a repertoire_id, repertoire_description, time_point. There are no other comparable occurrences in the schema files where new objects are defined in such a way. Since this object is listed under 'items', I cannot infer a name for this object when auto-generating LinkML. Therefore, I would suggest creating a new named class, e.g., RepertoireReference, which contains these three fields, and is used as a 'type' for this array.

Yes, this is a good point that we haven't addressed in #773 yet, and is a criteria we applied to the germline gene set schema. Adding a new class is the solution, though I'm not sure the best name. Maybe RepertoireDetail as its details about how the repertoire is organized in the group.

@schristley
Copy link
Member

Hi @LonnekeScheffer , we need to differentiate these issues between v1.5 and v2.0. If v2.0 ends up being broken for awhile, that's ok, all the data is currently in v1.5 so that's the priority. Maybe some of these issues are only present in v2.0 so we can ignore them for now.

@LonnekeScheffer
Copy link
Author

Hi @schristley, I have now updated the list to separate the airr v1.5 issues (which also occur in v2.0) from the v2.0-only issues

This was linked to pull requests Nov 26, 2024
@schristley
Copy link
Member

  • Some enums are defined several times, and differ in that they contain 'null' in some places, but not in others: locus (contains 'null' in Rearrangement, but not in AlleleDescription, GermlineSet, Genotype); mhc_class (contains 'null' in Reactivity, not in MHCGenotype)

For locus, this is the difference between being required or not, as it is required for the Germline classes but not for Rearrangement. We will need to handle this as an exception in the AKC convert script.

Likewise for mhc_class

@schristley
Copy link
Member

schristley commented Nov 26, 2024

  • For Ontology 'Property' in class 'CellExpression' (v1.5) / 'Expression' (v2.0) the expected field x-airr/ontology/top_node/id is missing.
  • For ontology antigen in class. 'ReceptorReactivity' (v1.5) / 'Reactivity' (v2.0) the expected field x-airr/ontology/top_node/id is missing.

@bussec @bcorrie This seems to be incorrect usage by AIRR standards. Gene names are not ontology-based exactly, i.e. each individual gene is not an ontology term. This is more like a controlled vocabulary from an external source. I don't think we have a similar situation elsewhere in the AIRR schema, so I think we need to define this differently.

Likewise for antigen, each antigen is not an ontology term.

@schristley
Copy link
Member

  • Property nodes of class Tree should have $ref: '#/Node' directly under nodes, now it is placed under a field called additionalProperties which is not used anywhere else

@LonnekeScheffer this is actually correct, it is just uncommon. What this means is that instead of the keys/properties of the dictionary being pre-defined, they are dynamic. The additionalProperties then indicates the schema for those dynamic properties.

For LinkML, this gets translated into a multivalued range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment