-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update main pipeline output to produce usable GPAD/GPI 2.0 #2043
Comments
Location of @sierra-moxon's files |
first pass of merged GPAD (all noctua MGI annotations from current.geneontology.org + all preprocessed/upstream annotations produced in this mini-pipeline in a GPAD 2.0 file): I already see two issues: taxon isn't coming through the conversion for some of the rows, and some of the rows were labeled as provided_by MGI -- I think both of these issues come from the GAF->GPAD step and not as a result of the underlying GAF generation, but I am confirming. |
I am seeing !gpa-version: 1.2 at the end of the merged_gpad_11_08_2023.txt. thanks. |
I am looking at the errors that Lori's load threw:
Today I will also just do a sanity check on @sierra-moxon's file. |
Notes from yesterday's group call.
|
Hi @sierra-moxon. I have a couple of questions to reassure myself that I didn't just think things without actually putting them into the requirements.
|
After a bit of investigation:
|
@leemdi and @sierra-moxon |
@ukemi Could the nature of this immunoglobulin UniProt be causing its own specific problem? UniProt has separate instances for the constant region and the variable region of what occurs in the body, mouse or human, as a single polypeptide encoded by a gene that is not present in the germline but created somatically. The annotation is thus a hack in two ways (both unavoidable, as far as I can tell). First, it represents the full length immunoglobulin chain as a complex of a UniProt C protein and a separate UniProt V protein. Second, that V protein is an arbitrarily chosen single instance because there's no way to represent the diversity of possible V regions in this annotation. |
Yep! I think you are spot on and this may be the case with the 54 annotations in that report! When I look at the report now that you point this out, all the genes are things like this (immunoglobulin regions, histones). So the one above might be a red herring wrt why all the annotations are failing our load. I also think that we are filtering on our end because we don't have Reactome reactions/pathways as references in MGI. It was in that report that I first detected this. eg: Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-1008243 |
when we process the GOA/Mouse, we save all of the Reactome rows (for which we do not have the reference in MGI) to a goamouse.gaf file. Then we append the goamouse.gaf to the end of our mgi/gaf. |
@sierra-moxon |
Looks like the ones in the 1.2 format, which I am skipping, are the NOCUTA ones. I think you have mentioned this earlier. example: Shh GO:0000122 |
@sierra-moxon and @leemdi Looks like the mapping needs updating on the EMAPA/UBERON end. I have emailed Terry about this and sent her the list. Terry says she will look at the list and open tickets for new mappings at UBERON. Since she has a much better background in all things anatomy than I do, this is a good plan. |
the lines with multiple entries in field 7 contain '"' this line is OK: this line has '"' in line 7: is the '"' surrounding the multiple UniProtKB terms expected? or is this something you want to fix? |
Hi @leemdi, I believe comma and pipe separated values in the 'with' field are allowed. |
@ukemi @sierra-moxon |
Sorry! I missed that point. I don't think there should be a ". |
yes, I can do that, but I don't have it done yet.
Yes, I include IEA annotations from GOA, but not from the orthology transformation loads (ISO loads) |
definitely I want to fix this! :) thank you for finding it. |
gotcha - I am updating. |
@sierra-moxon 133 annotations examples: GO:0000083 MGI:101934 J:164563 ISO UniProtKB:Q14186 GO_MGI 2023-03-22 MGI go_qualifier_id&=&RO:0002331&==&go_qualifier_term&=&involved_in&==&evidence&=&ECO:0000266 GO:0090305 MGI:102779 J:164563 ISO UniProtKB:P39748 GO_MGI 2016-09-14 MGI go_qualifier_id&=&RO:0002331&==&go_qualifier_term&=&involved_in&==&evidence&=&ECO:0000266 |
absolutely! I will rerun with new files. |
@sierra-moxon |
in field 11/properties I am used to seeing things like this: now I am seeing this: 2018-07-23 GO_Central RO:0002233(RNAcentral:URS000075DA6B_10090),RO:0002233(RNAcentral:URS000075A5B2_10090),RO:0002233(RNAcentral:URS000075E1B6_10090),BFO:0000066(UBERON:0002107) David, so, I assume that I need to map the RO or BFO to terms? But not sure what to do with the rest of the info in this new Property. David: I have found a couple of duplicates in MGI/GO Property vocabulary. occurs_at | BFO:0000066 happens_during | RO:0002092 has_input | RO:0002233 has_target_end_location | RO:0002339 exists_during | RO:0002491 |
@sierra-moxon MGI:MGI:108212 RO:0002331 GO:0042327 PMID:21052097 ECO:0000315 2014-07-25 GO_Central RO:0002092(GO:0060546)|RO:0002233(UniProtKB:Q63844) MGI:MGI:108212 RO:0002331 GO:0070301 PMID:27258785 ECO:0000315 2018-11-27 GO_Central "BFO:0000066(CL:0000746),BFO:0000050(GO:0070301)" |
@sierra-moxon In MGI, for the human to mouse and rat to mouse GO annotations we use the date of data loaded via orthology. |
@LiNiMGI - the original annotation in GOA for the first IKR example is:
it has an I think my script is behaving as designed, so I guess the next question is: if GO_Central provides this, we need to figure out where? (maybe it's in a different form or to a different ID or included in some other ingest? - asking around on my side - do you have any insight?) |
How many annotations are we talking here? |
only 5 annotations, just wondering why there were not in the file... |
@kltm @LiNiMGI - easy enough to skip. The IKR issue that worries me is that in the GOA annotation records the provider as "GO_Central", however, the final GPAD and GAF produced by this test pipeline are missing these annotations. Tracing the annotation, it seems like it's something like this: So where is the original location of this annotation? (e.g. noctua, some sort of external ingest to noctua, etc.) |
@sierra-moxon |
Here are some isoform annotations from the mouse Noctua GPAD: PR Q60636-2 located_in GO:0005737 PMID:18845144 ECO:0000314 20101013 MGI contributor=https://orcid.org/0000-0003-2689-5511|model-state=production|noctua-model-id=gomodel:MGI_MGI_99655 |
Some relations in the Noctua GPAD are not being converted from labels to text correctly geneontology/go-annotation#5092 |
will the new mgi.gpad, mgi.gaf files still be here: |
Once we are completely done done the project down the line, the files will be appearing at that location. (The test files are, naturally, elsewhere, in the interim.) |
This is not blocking for the current project (MGI remainders) |
@kltm official: snapshot: |
@leemdi Both pipelines are in a bit of a state right now, but we have What you are using for any specific use case will be up to you. |
I don't really know what use case I would use to make this decision. From my perspective, I should use "current". @LiNiMGI @ukemi , what do you suggest that we use going forward? "snapshot" or "current"? |
Thanks!!! @sierra-moxon I will do more test later today and tomorrow. |
Please see: geneontology/gopreprocess#58 for the remaining issues. Noting that the human and rat orthology loads use the same code that sets the provided_by to "GO_Central" in the preprocessing pipeline. And, when I look at the GAF file for the human and rat outputs of the preprocessing pipeline here: http://skyhook.berkeleybop.org/silver-issue-325-gopreprocess/products/upstream_and_raw_data/preprocessed_GAF_output/mgi-human-ortho.gaf and http://skyhook.berkeleybop.org/silver-issue-325-gopreprocess/products/upstream_and_raw_data/preprocessed_GAF_output/mgi-rgd-ortho.gaf they both show ONLY GO_Central as the provider (as expected). |
here's an example: one row has "MGI", one row as "GO_Central". MGI:MGI:99961 RO:0002327 GO:0003700 GO_REF:0000096 ECO:0000266 RGD:620975 2024-03-18 MGI |
right - I am tracking here: geneontology/gopreprocess#58 |
noctua issue tracking here: geneontology/gopreprocess#59 When I run the validate.py produce command in ontobio locally, I do not see the duplicate noctua annotations being generated in the GPAD file:
there must be another step in the pipeline... |
Next round of files are out for review: http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/mgi.gpad.gz |
thanks, Sierra. Processed on MGi-Scrum, and sent to Li for review |
Annotation date for the below annotations should not be changed to the loading date, we should keep the original annotation date: I think these are all from the GOA mouse file. we should only change the date of rat to mouse/human to mouse. |
tracking date change issue for protein to GO files here: geneontology/gopreprocess#58 |
noting that the provided_by and noctua duplicates issues have been resolved here: geneontology/gopreprocess#53 and geneontology/gopreprocess#59 respectively |
@sierra-moxon If all tasks here are done or moved, can we then close this ticket? |
yes, I think so. |
This was created from a conversation @ukemi and @sierra-moxon , making explicit an implicit task.
geneontology/gopreprocess#9
The text was updated successfully, but these errors were encountered: