Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance/rework update routines #566

Open
jsstevenson opened this issue Jan 22, 2025 · 0 comments
Open

Enhance/rework update routines #566

jsstevenson opened this issue Jan 22, 2025 · 0 comments

Comments

@jsstevenson
Copy link
Contributor

This was a pretty helpful discussion on the google doc and I wanted to preserve some of the brainstorming/solution mining

- Don’t throw away the whole database on import. Instead, update existing claims as needed and mark claims as deprecated if they have been removed or the source data has marked them as rejected/deprecated
- Re-group these existing claims but keep track of which group they were assigned to before. Display a history of the assigned groups over time
- Regroup only when a claim changed? Or is this also necessary because the grouper themselves might assign a different group
- Some known problems:
   - How to do this in real time? Right now we basically take the db offline when importing/grouping. 
     - Probably two options here. 1) Do we change the importers/normalizers/groupers to be robust enough to run inline on the prod instance (wrap everything in a transaction? Will that use too much memory?) 2) do we develop some sort of data format that we can output from the import pipeline that can be loaded into the running instance?
  - How do we link existing claims to their entry in the source document. E.g. if it’s a TSV there is often no unique identifier to reference them
    - TSVs shouldn’t change over time so there wouldn’t be a re-import
    - Do a review of sources that actually change over time and confirm that there are unique identifies for at least the gene and drugs.
    - Some source databases that change over time do have unique IDs for the interaction (e.g. CIViC with EIDs). If there isn’t, how do we make sure we can identify an interaction claim and link it to the same interaction from the source.
  - Do we need to rethink how we import interaction claims. Currently we use a first_or_create causing multiple source entries with the same gene-drug combination to be imported as one claim. E.g. should multiple CIViC EIDs with the same gene drug combination be imported as one claim? Even if they are a mix of support/does not support?
  - What should our unique identifier be for interaction claims and interaction groups
    - James proposed some combination of grouped gene + grouped drug concept as a unique id. Could maybe hash that along with the source name to create some sort of stable id?
    - Do we want to use stable ids from external sources when present, or rely on internal ids for consistency?
       - (This only matters at the claim level)
       - External IDs seem good if available. This is also a potentially useful UI element (link back to primary source).
Advantages:
- Persistent URLs/identifiers for DGIdb claims and groups
- easier linking from external resources to DGIdb data
- Enable leaving comments/change suggestions (Aim 1.4)
Ideal outputs:
- Data dumps of DGIdb database + normalizer DBs (for data provenance purposes), data TSVs. These are currently manually generated which is suboptimal
- Diffs of input data (e.g. for summarization in release notes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant