-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coordination and curation of studies in the ADC #431
Comments
We try to do everything at once (not have metadata loaded without rearrangements) as an iReceptor repository policy. Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective. Our gateway searches AIRR-seq data, and it ain't AIRR-seq data without rearrangements 8-) I think it is potentially even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study. From a data provenance perspective that seems bad to me... It seems the most consistent for a study and its data to "come on line" all at once Now whether we want to make that a policy or not, I am not sure, as it seems a bit draconian to me... For example, we recently had our latest COVID-19 paper added to the AIRR COVID-19 repository. The paper released in pre-print and it stated the data was on line in iReceptor but we didn't have the data loaded yet. So we broke our own policy as we didn't want people to go to the iReceptor Gateway and not find the study (even though there were no rearrangements). I think something like that might is OK, as long as it doesn't sit that way for a long period of time. We were in the process of loading the rearrangements and we enabled the study with the rearrangements a couple of days later... I think having repertoire metadata with no rearrangements for a long period of time is probably bad. How we create a recommendation and policy around that I am not too sure... |
Yes, this is cumbersome, and we spend a fair bit of time keeping this up to date. It would be nice if we had a better mechanism... |
Yeah, me too, and have a dozen studies in various phases of being processed... |
Yep, that's exactly the type of benefit I was thinking of... |
Right, but it seems you don't have control over that, as repositories will be doing whatever. Then your only choice from a Gateway perspective is either turn off the whole repository, or live with the "partial-ness".
I agree and I feel that could serious ramifications, somebody downloading partial data, do analysis, publish etc, without ever realizing they didn't have the full data. In this case, I feel it's important from a scientific perspective to avoid this, but yeah we may not be able to enforce it, but it's worth putting as a recommendation so that data repositories understand why it is a "bad thing"
The easiest solution to me seems to be adding a flag into the repertoire. This would allow clients like the Gateway to include or exclude those repertoires, have a setting the user can toggle to include or not, have a warning message "study in process of being loaded", or something like that. It won't handle all scenarios but can cover a lot. |
Yes, I agree, a strong recommendation perhaps, stressing scientific reproducibility and data provenance as the drivers for this recommendation. |
Not sure about this one. It relies on adherence to the recommendation, just like not having partial data does 8-) And it might not help that much, because you simply raise the level at which there is confusion. You might have a study that is incomplete with only 1 out of 10 repertoires loaded, but there is no way of knowing that either. So you have the same problem. I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it. |
@schristley this doesn't seem to be required for a v1.4 release, and I won't have time to get to it in the next few days, do you? If not, should we remove it from the v1.4 milestone? |
@schristley I suggest adding this to the documentation for v2.0 as a recommendation for data stewards/data curators - although it can't really be enforced. |
I think we are starting to get to a solution that will allow some level of partialness. When this issue was opened it was mainly We have added study keywords like Now that we have agreed upon an extension capability for the API, let's use it to formalize some behaviors we'd like. |
@schristley I added some basic "Running a repository" docs here: |
I think in a way, the docs I just added kind of address this. Maybe we can make it a bit more explicit. I agree, it makes sense to for a data steward to load "part of a study" if part of a study means
Later they might add Clone data. Or maybe Cell + Expression. We do this all the time. I think what we want to suggest is that it is not a good idea to be loading the data for one of the Schema objects (Rearrangement, Clone, Cell, Expression etc) a bit at a time. Basically, if you are loading Rearrangements for a Study then only make that data available in the ADC once all of the Rearrangements are loaded. |
I think we should come up with some docs now for the v2.0 release - maybe what I have now is sufficient? Maybe a bit more detail. We can then consider the extensions that you are talking about, which I think are beyond v2.0? |
We currently use the study keywords to indicate that there is data in the repository for that schema object for that study. So if a Repertoire in our repository has This is an interesting subtle difference you bring up, which makes me think the more detailed discussion is definitely v2.1. |
Made changes to docs to clarify loading schema objects in their entirety (e.g. Rearrangements) as opposed to load a study in its entirety (which may include Rearrangement, Clone, Cell, CellExpression, etc). Basically we are saying it is OK to load Rearrangements, put a repository into production, then load Cells later. It is frowned on to load a single schema object (e.g. Rearrangement) data from a study partially on a production repository (e.g. having a study with only half the Rearrangements loaded in production). |
Created #756 to capture new issue, closing this issue. |
* Update facet docs As per #617 * Removal/deprecation of is and not operators * New release notes file for ADC API Added deprecation of is and not. * Error codes, repository loading changes As per #431 and #487 * Add 408 and 413 errors * Added 408 and 413 errors * Add docs for AA/nt case discussion As per #528 * Update data loading recommendation * Remove docs about deprecated not operator * Update to array query docs. * Typo fix
This has come up in a number of contexts. With ADC V1 out, there is more activity to curate historical studies and put them into data repositories. It would be good to coordinate so there isn't too much duplication of effort. Also, it might be helpful to have a common resource about curation questions, how to code thing properly into MiAIRR and so forth. Some things we might want to consider:
FAQ for common curation questions. There are few of these scattered around in issues (e.g. PBMC for cell_subset ontology #242 pcr_target_locus VS cell_subset - general question #413 ). There's also a start of some info in the docs. Shall we make this more formal?
How to report and handle curation errors. Each repository may or may not have a formal mechanism to ask questions about curation, how to get them fixed, who's responsible for fixing, and so forth. Some questions may involve ambiguity in the AIRR standards and lead us to document or make them more precise. Do we have a central place where curators and users can go?
How to signal that a repository is in the process of curating a study? There's the informal method of using the lists up on the b-t.cr forums, but one thing I've noticed is there is a size limit to posts on b-t.cr so these lists invariably will need to span multiple posts, and they will need to be re-adjusted all the time as new papers are added.
Related to the last point, curating a study has multiple steps, creating the AIRR study metadata is one, but the sequences also need to be processed and loaded, which can be more time-consuming. Do we consider it ok for repositories to put up repertoires even though the rearrangements are not available yet? Do we want to require that a study must appear "all at once" in a repository so partial data isn't queried? Introduce a flag to warn the user that a study is "in process of loading?" I can see a certain benefit for repertoire metadata to be available immediately.
other ideas?
The text was updated successfully, but these errors were encountered: