Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coordination and curation of studies in the ADC #431

Closed
schristley opened this issue Jul 2, 2020 · 16 comments · Fixed by #752
Closed

coordination and curation of studies in the ADC #431

schristley opened this issue Jul 2, 2020 · 16 comments · Fixed by #752
Labels
ADC API V2 AIRR Data Commons API V2 documentation MiAIRR release discussion Needs discussion on whether to include in v2.0 release.
Milestone

Comments

@schristley
Copy link
Member

This has come up in a number of contexts. With ADC V1 out, there is more activity to curate historical studies and put them into data repositories. It would be good to coordinate so there isn't too much duplication of effort. Also, it might be helpful to have a common resource about curation questions, how to code thing properly into MiAIRR and so forth. Some things we might want to consider:

  • FAQ for common curation questions. There are few of these scattered around in issues (e.g. PBMC for cell_subset ontology #242 pcr_target_locus VS cell_subset - general question #413 ). There's also a start of some info in the docs. Shall we make this more formal?

  • How to report and handle curation errors. Each repository may or may not have a formal mechanism to ask questions about curation, how to get them fixed, who's responsible for fixing, and so forth. Some questions may involve ambiguity in the AIRR standards and lead us to document or make them more precise. Do we have a central place where curators and users can go?

  • How to signal that a repository is in the process of curating a study? There's the informal method of using the lists up on the b-t.cr forums, but one thing I've noticed is there is a size limit to posts on b-t.cr so these lists invariably will need to span multiple posts, and they will need to be re-adjusted all the time as new papers are added.

  • Related to the last point, curating a study has multiple steps, creating the AIRR study metadata is one, but the sequences also need to be processed and loaded, which can be more time-consuming. Do we consider it ok for repositories to put up repertoires even though the rearrangements are not available yet? Do we want to require that a study must appear "all at once" in a repository so partial data isn't queried? Introduce a flag to warn the user that a study is "in process of loading?" I can see a certain benefit for repertoire metadata to be available immediately.

other ideas?

@schristley schristley added ADC API V1 AIRR Data Commons API V1 MiAIRR labels Jul 2, 2020
@bcorrie
Copy link
Contributor

bcorrie commented Jul 3, 2020

  • Related to the last point, curating a study has multiple steps, creating the AIRR study metadata is one, but the sequences also need to be processed and loaded, which can be more time-consuming. Do we consider it ok for repositories to put up repertoires even though the rearrangements are not available yet? Do we want to require that a study must appear "all at once" in a repository so partial data isn't queried? Introduce a flag to warn the user that a study is "in process of loading?" I can see a certain benefit for repertoire metadata to be available immediately.

We try to do everything at once (not have metadata loaded without rearrangements) as an iReceptor repository policy. Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective. Our gateway searches AIRR-seq data, and it ain't AIRR-seq data without rearrangements 8-) I think it is potentially even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study. From a data provenance perspective that seems bad to me... It seems the most consistent for a study and its data to "come on line" all at once

Now whether we want to make that a policy or not, I am not sure, as it seems a bit draconian to me...

For example, we recently had our latest COVID-19 paper added to the AIRR COVID-19 repository. The paper released in pre-print and it stated the data was on line in iReceptor but we didn't have the data loaded yet. So we broke our own policy as we didn't want people to go to the iReceptor Gateway and not find the study (even though there were no rearrangements). I think something like that might is OK, as long as it doesn't sit that way for a long period of time. We were in the process of loading the rearrangements and we enabled the study with the rearrangements a couple of days later...

I think having repertoire metadata with no rearrangements for a long period of time is probably bad. How we create a recommendation and policy around that I am not too sure...

@bcorrie
Copy link
Contributor

bcorrie commented Jul 3, 2020

  • How to signal that a repository is in the process of curating a study? There's the informal method of using the lists up on the b-t.cr forums, but one thing I've noticed is there is a size limit to posts on b-t.cr so these lists invariably will need to span multiple posts, and they will need to be re-adjusted all the time as new papers are added.

Yes, this is cumbersome, and we spend a fair bit of time keeping this up to date. It would be nice if we had a better mechanism...

@schristley
Copy link
Member Author

Yes, this is cumbersome, and we spend a fair bit of time keeping this up to date. It would be nice if we had a better mechanism...

Yeah, me too, and have a dozen studies in various phases of being processed...

@schristley
Copy link
Member Author

So we broke our own policy as we didn't want people to go to the iReceptor Gateway and not find the study

Yep, that's exactly the type of benefit I was thinking of...

@schristley
Copy link
Member Author

Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective.

Right, but it seems you don't have control over that, as repositories will be doing whatever. Then your only choice from a Gateway perspective is either turn off the whole repository, or live with the "partial-ness".

even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study.

I agree and I feel that could serious ramifications, somebody downloading partial data, do analysis, publish etc, without ever realizing they didn't have the full data. In this case, I feel it's important from a scientific perspective to avoid this, but yeah we may not be able to enforce it, but it's worth putting as a recommendation so that data repositories understand why it is a "bad thing"

How we create a recommendation and policy around that I am not too sure...

The easiest solution to me seems to be adding a flag into the repertoire. This would allow clients like the Gateway to include or exclude those repertoires, have a setting the user can toggle to include or not, have a warning message "study in process of being loaded", or something like that. It won't handle all scenarios but can cover a lot.

@bcorrie
Copy link
Contributor

bcorrie commented Sep 10, 2020

Personally, I think it is confusing to have repertoire metadata without rearrangements - in particular from the iReceptor Gateway perspective.

Right, but it seems you don't have control over that, as repositories will be doing whatever. Then your only choice from a Gateway perspective is either turn off the whole repository, or live with the "partial-ness".

even more confusing to the user/consumer when data "trickles in" if data is slowly loaded for a study.

I agree and I feel that could serious ramifications, somebody downloading partial data, do analysis, publish etc, without ever realizing they didn't have the full data. In this case, I feel it's important from a scientific perspective to avoid this, but yeah we may not be able to enforce it, but it's worth putting as a recommendation so that data repositories understand why it is a "bad thing"

Yes, I agree, a strong recommendation perhaps, stressing scientific reproducibility and data provenance as the drivers for this recommendation.

@bcorrie
Copy link
Contributor

bcorrie commented Sep 10, 2020

The easiest solution to me seems to be adding a flag into the repertoire. This would allow clients like the Gateway to include or exclude those repertoires, have a setting the user can toggle to include or not, have a warning message "study in process of being loaded", or something like that. It won't handle all scenarios but can cover a lot.

Not sure about this one. It relies on adherence to the recommendation, just like not having partial data does 8-)

And it might not help that much, because you simply raise the level at which there is confusion. You might have a study that is incomplete with only 1 out of 10 repertoires loaded, but there is no way of knowing that either. So you have the same problem.

I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it.

@schristley schristley added this to the AIRR v1.4.0 milestone Jan 17, 2022
@bcorrie
Copy link
Contributor

bcorrie commented May 9, 2022

@schristley this doesn't seem to be required for a v1.4 release, and I won't have time to get to it in the next few days, do you? If not, should we remove it from the v1.4 milestone?

@bcorrie bcorrie removed this from the AIRR v1.4.0 milestone May 15, 2022
@scharch scharch added documentation ADC API V2 AIRR Data Commons API V2 release discussion Needs discussion on whether to include in v2.0 release. and removed ADC API V1 AIRR Data Commons API V1 labels Jul 10, 2023
@scharch scharch added this to the ADC V2 milestone Jul 10, 2023
@bcorrie
Copy link
Contributor

bcorrie commented Feb 16, 2024

I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it.

@schristley I suggest adding this to the documentation for v2.0 as a recommendation for data stewards/data curators - although it can't really be enforced.

@schristley
Copy link
Member Author

schristley commented Feb 16, 2024

I would lean toward recommending that an entire study be made "available" at once in its entirety. If a repository doesn't follow the recommendation then one can't do much about it.

@schristley I suggest adding this to the documentation for v2.0 as a recommendation for data stewards/data curators - although it can't really be enforced.

I think we are starting to get to a solution that will allow some level of partialness. When this issue was opened it was mainly Repertoire and Rearrangement but now the notion of "entire study" has broken down with the new objects and API endpoints like Clone and etc.

We have added study keywords like contains_schema_rearrangement which can be used as a flag like I suggested above. The question is how to interpret those keywords? If you take them to mean that "the authors generated rearrangement data as part of the study", then that has a different meaning from the ADC repository has rearrangement data loaded for this study. If you think the former interpretation is what is meant, then my suggestion is we extend the Study object in the API with a field like adc_keywords (maybe you have a better suggestion?) to represent the latter interpretation.

Now that we have agreed upon an extension capability for the API, let's use it to formalize some behaviors we'd like.

bcorrie added a commit that referenced this issue Feb 16, 2024
@bcorrie
Copy link
Contributor

bcorrie commented Feb 16, 2024

@schristley I added some basic "Running a repository" docs here:

ab2b8ce

@bcorrie
Copy link
Contributor

bcorrie commented Feb 16, 2024

I think we are starting to get to a solution that will allow some level of partialness. When this issue was opened it was mainly Repertoire and Rearrangement but now the notion of "entire study" has broken down with the new objects and API endpoints like Clone and etc.

I think in a way, the docs I just added kind of address this. Maybe we can make it a bit more explicit.

I agree, it makes sense to for a data steward to load "part of a study" if part of a study means

  • Repertoire
  • Rearrangements

Later they might add Clone data. Or maybe Cell + Expression. We do this all the time.

I think what we want to suggest is that it is not a good idea to be loading the data for one of the Schema objects (Rearrangement, Clone, Cell, Expression etc) a bit at a time. Basically, if you are loading Rearrangements for a Study then only make that data available in the ADC once all of the Rearrangements are loaded.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 16, 2024

I think we should come up with some docs now for the v2.0 release - maybe what I have now is sufficient? Maybe a bit more detail.

We can then consider the extensions that you are talking about, which I think are beyond v2.0?

@bcorrie
Copy link
Contributor

bcorrie commented Feb 16, 2024

We have added study keywords like contains_schema_rearrangement which can be used as a flag like I suggested above. The question is how to interpret those keywords? If you take them to mean that "the authors generated rearrangement data as part of the study", then that has a different meaning from the ADC repository has rearrangement data loaded for this study. If you think the former interpretation is what is meant, then my suggestion is we extend the Study object in the API with a field like adc_keywords (maybe you have a better suggestion?) to represent the latter interpretation.

We currently use the study keywords to indicate that there is data in the repository for that schema object for that study. So if a Repertoire in our repository has contains_schema_rearrangement set, then there will be Rearrangement data in the repository.

This is an interesting subtle difference you bring up, which makes me think the more detailed discussion is definitely v2.1.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 19, 2024

Made changes to docs to clarify loading schema objects in their entirety (e.g. Rearrangements) as opposed to load a study in its entirety (which may include Rearrangement, Clone, Cell, CellExpression, etc).

Basically we are saying it is OK to load Rearrangements, put a repository into production, then load Cells later. It is frowned on to load a single schema object (e.g. Rearrangement) data from a study partially on a production repository (e.g. having a study with only half the Rearrangements loaded in production).

@bcorrie
Copy link
Contributor

bcorrie commented Feb 19, 2024

Created #756 to capture new issue, closing this issue.

@bcorrie bcorrie closed this as completed Feb 19, 2024
bcorrie added a commit that referenced this issue Feb 20, 2024
* Update facet docs

As per #617

* Removal/deprecation of is and not operators

* New release notes file for ADC API

Added deprecation of is and not.

* Error codes, repository loading changes

As per #431 and #487

* Add 408 and 413 errors

* Added 408 and 413 errors

* Add docs for AA/nt case discussion

As per #528

* Update data loading recommendation

* Remove docs about deprecated not operator

* Update to array query docs.

* Typo fix
@github-project-automation github-project-automation bot moved this to Done in ADC API Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADC API V2 AIRR Data Commons API V2 documentation MiAIRR release discussion Needs discussion on whether to include in v2.0 release.
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants