-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import AIRR Data Model into AKC LinkML #28
import AIRR Data Model into AKC LinkML #28
Comments
Hi @bcorrie so when I mentioned I'd like you to work with LinkML, I was really thinking about this. We played with that |
@schristley where is the |
@bcorrie when I was playing around with it, I was just manually installing in the ak-schema docker. If I remember, the pip install doesn't install all dependencies, there was one that was missing. I haven't put it in the ak-schema docker yet because I'm not sure if it conflicts with the linkml stuff or not |
@bcorrie here it is, needed to also do |
@schristley the above install downgrades the urllib3 version from urllib3-2.2.1 to urllib3-1.26.18. poetry.lock states that it requires urllib3 = ">=1.21.1,<3" so this should be fine. Should we add this to the docker file? I have a patch that adds the following to the end of the Dockerfile after the poetry update:
|
@bcorrie Walking through the objects in the AIRR Schema to understand what we can auto-generate into LinkML and what we cannot, here's my assessment of a few things. Can you please review?
So based upon this, my thought is to do a quick "bootstrap" conversion of a few of the AIRR objects into LinkML. Germline, Genotype, Rearrangements and (maybe) Cell? That is, we won't worry about automation right now. However as part #44 , we will need to consider how to manage AIRR schema changes and come up with an automation mechanism. |
I think my fundamental comment is that a mapping of any AIRR field to a LinkML slot definition is probably pretty straightforward. I think in all cases the relationships between the LinkML classes are what is going to be challenging. For example,
Yes, I think so. Like most of the objects below, the field to slot mapping between AIRR and AKC is pretty straightforward. It is the relationships that are messy. The question around this we were discussing is is there some sort of automated tool that we might be able to use to help with this. The complicated part is going to be mapping the relationships (the de-normalization and re-structure) and I am not aware of anything that would help with this, if you know of anything let me know. The reason this is hard is that these objects are at the core of the AKC CDM AND the relationships in the AIRR Standard do not map particularly well to the AKC CDM. We can think of things like As we consider complex use cases, I would not be surprised if the requirements for complex relationships blows up. That is why I advocate for keeping the relationships in the AKC CDM as simple and basic as possible, with the anticipation that specific use cases are going to need much more complicated "knowledge graphs" overlaid on top of this simple set of relationships. If we try and capture all relationships for all things in the AKC CDM we will literally go insane 8-) [Stuff Deleted]
As I mentioned, I was thinking of using the AIRRMap python class (https://github.com/sfu-ireceptor/dataloading-mongo/blob/master/dataload/airr_map.py) from the iReceptor data loader combined with the AIRR Spec Flatten tool in the iReceptor sandbox (https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten) as a first attempt at this. I think it should be pretty easy to combine these to traverse any AIRR JSON object (e.g. Subject, Genotype, Rearrangement, ...) and generate a bunch of LinkML slots based on the AIRR definition. I think for me the question is, are we talking about generating LinkML definitions for these objects, or actually generating LinkML compliant data from the equivalent definitions. I think generating LinkML definitions for slots should be pretty simple. |
I would argue that like all of the other AIRR schema objects, converting the fields for As I say above, the reason this is "complicated" is that these objects have complex relationships not only with the AIRR Standard but across the other repositories as well (IEDB, iRAD). But it is the relationships we don't understand, I think we understand pretty well the actual field names. |
Yes, I agree, thanks, I should be more clear. Let's not worry about trying to bring the relationships forward, just the slots and the classes (where class is the same as the AIRR JSON object). |
Again, the relationships between these objects and other objects in the AKC CDM may be less defined, but mapping the fields I would suggest is pretty straightforward. |
The point is well taken. With a "data model", we have well-defined relationships, often then translated into a static database design/schema, which then enforces how queries are done. A "knowledge model" needs to be more flexible to handle the complex use cases. LinkML is for data models, so we don't want to overload it and try to make it do too much. So in essence I'm agreeing with you, we'll keep the relationships in the CDM to the simple, basic and "obvious" ones. |
I understand this as doing the actual data integration, versus the schema. I agree that a mapping approach like this should work very well for us. |
The approach we have taken with the AIRR Config file and the use of AIRR Flatten should go a fairly long way to making this work. We can essentially turn an iReceptor Turnkey repository or an iReceptor Gateway so that it supports different versions of the AIRR Standard just by changing the AIRR Config file.
There are of course always special cases (e.g. string change to Ontology term) which can't be handled by a mapping, but this has gotten us a long way to making schema changes "relatively painless". |
Just the definitions, the slots and the classes.
@bcorrie Great! Can I give you the task to take an initial stab at writing this? |
Yep, no problem... |
Initial version implemented. Code is here: https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/scripts/airr2akc Initial exported schemas (a subset, although a pretty decent subset): https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/ak_schema/schema/airr I essentially reused https://github.com/sfu-ireceptor/sandbox/tree/master/airr-spec-flatten and mostly just changed the output generation. I disabled recursion as well so it doesn't process Objects within Objects. It isn't handling arrays correctly (although not sure how we do that in LinkML). It also needs to have a mapping step so we have control if a field's attributes (e.g. name, type, range) are not the default that would be generated from the AIRR spec. I should be able to reuse the iReceptor data loader's AIRR Map capability to implement that pretty easily. |
It looks like we should be able to use this to generate the Enums for fields as well, you will notice in the export I have created LinkML fields that capture either the AIRR ontology root node (for Ontologies) and the enum values for controlled vocabulary fields. See the Ontology and Enum fields in the subject export: https://github.com/airr-knowledge/ak-schema/blob/airr-export/src/ak_schema/schema/airr/ak_airr_subject.yaml |
I have changed the code so you can ask it to generate either the LinkML Slots or LinkML enums for the AIRR Schema Object of choice. For Ontology terms it just outputs the expected root node of the enum, we would still need a way to generate all of the children node for that enum. For example, for
|
If you ask for LinkML slots, it generates this - referring to the correct Enums above in the
|
Files generated for most (all?) AIRR schema objects of importance to AKC here: https://github.com/airr-knowledge/ak-schema/tree/airr-export/src/ak_schema/schema/airr Note some enum files are empty because there are no enums/ontologies in that particular class. |
@schristley I think the following x-airr attributes are relevant:
Is there anything else that we need to worry about. My conversion tool now takes these into accounts and generates these as part of the AIRR field LinkML specification. |
I have created a separate issue for figuring out what, if anything from our LinkML experience should be moved back into the AIRR Standard in #54 |
I think we can mark this as Done? |
@schristley any objections to closing this issue? |
Lonneke is going to continue work on this. |
@LonnekeScheffer I am doing a refactor/cleanup of the code. I reused code from an iReceptor tool we had, and some of the old, no longer used code had not yet been deleted. I should have this done by the end of the week... |
@LonnekeScheffer clean up done on airr-export branch (https://github.com/airr-knowledge/ak-schema/tree/airr-export) |
@bcorrie would it make sense for me to start working on (a subbranch of) your airr-export branch in that case? |
Hi both, I've been playing around a bit at the airr2akc.py script, for now branching off of Brian's airr-export branch. Just wanted to check with you guys: I see that the generated files under "src/ak_schema/schema/airr" that are stored on this (and master) branch are not the same as the files generated by airr2akc.py when running Makefile.AIRR; I have for now presumed the files in the folder are an 'old format' whereas the script generates the 'new format'. I also seem to notice few bugs in the script output:
I understand the first bug must have been introduced at the moment it was decided to add the new 'enums' base level. While it can be fixed easily, I do think the current code with its explicit hardcoded print statements with fixed spaces inside highly nested loops/if statements is prone to such errors in the future. In immuneML we used yaml input/output a lot. You can make nested dictionaries/lists, and export them directly using yaml.dump(). No need to keep track of the number of spaces, only of how to nest the dictionaries and lists. This furthermore ensures the output file is valid YAML, which I think will greatly improve the readability and maintainability of the code. If it's ok with you, I can rewrite (or make a second version of) this script according to these suggestions. I think it'll also help me generally with 'getting into' the project and familiarizing myself with this code. |
I agree, that sounds like a good place to start. |
@bcorrie Are you using this branch for your Repertoire conversion work? @LonnekeScheffer that's fine for now, though I had already merged this branch to main, so I'd prefer if you branch off of main. Shall I merge Brian's recent changes and then you can switch over?
Yeah, that seemed odd to me when I saw linkml yaml with that. |
Would be great if you could merge it into main then, Scott! I just want to make sure I'm working with the latest version of airr2akc.py before I rewrite it, and that I'm not getting in @bcorrie's way. |
@LonnekeScheffer a couple of comments:
|
No problem. Once Scott merges to master, you can go ahead and create a branch and the code is all yours. I am going to be working on akc_convert, so we shouldn't conflict, but I will continue to do that on the airr_export branch. |
@LonnekeScheffer ok, merged, you should be good to go! I've added some TODO items in the first comment so you can think more strategically with planning code changes. |
Thanks Scott! Finished the "writing output with yaml library" part. Where can I find AIRR schema input files? |
the airr standards repository, here is the v1.5 schema |
FYI, the use of the AIRR Schema from the installed library was intentional in that it is simple to change generation by running a different docker container. Presumably when dealing with AIRR schema, there are benefits to using the AIRR Library. I have no strong objections to removing the tight coupling of the schema that is generated being tied to AIRR python version (basically provide the schema file as input), but there may be some benefits that are lost in doing so (not that I can think of any huge ones off of the top of my head). |
Yes, I understand, but I don't really want to deal with multiple docker images, and this will explode out as we deal with schema changes/versioning from the other repositories (if I need an image for each version combination, plus the complexity to run them all then merge the results into a single linkml schema). So I prefer that we are able to generate everything from a single docker. We will use @LonnekeScheffer are you familiar with |
No worries, just wanted to point out the rationale. On the iReceptor side, we only have one standard to deal with, so this isn't a big issue. We have one container per AIRR Standard release... |
I haven't used git submodule before, but I'll read up on it! so having an "AIRR versioned" LinkML schema, would this mean: one can specify different airr .yaml input files (for v1.5 or v2.0), and but it will always generate the same LinkML output yaml? Or would there be anything different about these LinkML yawls of different AIRR versions? |
Ok, that's good. I'll do the initial setup then...
The LinkML output would be different, to reflect the differences in the AIRR schema versions. By "AIRR versioned" LinkML schema, I mean the ability for (say) the v1.5 Repertoire and the v2.0 Repertoire definitions to co-exist together in the schema. LinkML has been discussing namespace support, which would be perfect use case for this, I think. With both versions available, we could perform various inferences and algorithms based upon the changes/differences. This goes towards Aim 2.6 of the grant. Though I'm saying "AIRR schema", technically the AIRR schema is written in OpenAPI3, which is a superset of JSON schema. So even though we are currently using it for AIRR, we will also use it for the repositories that provide an OpenAPI3 service, which currently is all repositories except IEDB. |
Deleted my previous comment because most of it was raising an issue I resolved in the meantime.. Current status of airr2akc conversion:
|
AKC will extend and integrate many of the classes/objects in the AIRR Data Model. LinkML has an importer for JSON schema. We want to automate the import/translation process so that we can run it when the AIRR Data Model changes.
x-airr
properties to LinkML. For example,identifier
is related LinkML'sidentifier
x-airr
properties need to be added to AIRR Standards for LinkML. #54keywords_study
in theStudy
do not generate a LinkML enumThe text was updated successfully, but these errors were encountered: