Enhance/rework update routines #566

jsstevenson · 2025-01-22T20:20:13Z

This was a pretty helpful discussion on the google doc and I wanted to preserve some of the brainstorming/solution mining

- Don’t throw away the whole database on import. Instead, update existing claims as needed and mark claims as deprecated if they have been removed or the source data has marked them as rejected/deprecated
- Re-group these existing claims but keep track of which group they were assigned to before. Display a history of the assigned groups over time
- Regroup only when a claim changed? Or is this also necessary because the grouper themselves might assign a different group
- Some known problems:
   - How to do this in real time? Right now we basically take the db offline when importing/grouping. 
     - Probably two options here. 1) Do we change the importers/normalizers/groupers to be robust enough to run inline on the prod instance (wrap everything in a transaction? Will that use too much memory?) 2) do we develop some sort of data format that we can output from the import pipeline that can be loaded into the running instance?
  - How do we link existing claims to their entry in the source document. E.g. if it’s a TSV there is often no unique identifier to reference them
    - TSVs shouldn’t change over time so there wouldn’t be a re-import
    - Do a review of sources that actually change over time and confirm that there are unique identifies for at least the gene and drugs.
    - Some source databases that change over time do have unique IDs for the interaction (e.g. CIViC with EIDs). If there isn’t, how do we make sure we can identify an interaction claim and link it to the same interaction from the source.
  - Do we need to rethink how we import interaction claims. Currently we use a first_or_create causing multiple source entries with the same gene-drug combination to be imported as one claim. E.g. should multiple CIViC EIDs with the same gene drug combination be imported as one claim? Even if they are a mix of support/does not support?
  - What should our unique identifier be for interaction claims and interaction groups
    - James proposed some combination of grouped gene + grouped drug concept as a unique id. Could maybe hash that along with the source name to create some sort of stable id?
    - Do we want to use stable ids from external sources when present, or rely on internal ids for consistency?
       - (This only matters at the claim level)
       - External IDs seem good if available. This is also a potentially useful UI element (link back to primary source).
Advantages:
- Persistent URLs/identifiers for DGIdb claims and groups
- easier linking from external resources to DGIdb data
- Enable leaving comments/change suggestions (Aim 1.4)
Ideal outputs:
- Data dumps of DGIdb database + normalizer DBs (for data provenance purposes), data TSVs. These are currently manually generated which is suboptimal
- Diffs of input data (e.g. for summarization in release notes)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance/rework update routines #566

Enhance/rework update routines #566

jsstevenson commented Jan 22, 2025

Enhance/rework update routines #566

Enhance/rework update routines #566

Comments

jsstevenson commented Jan 22, 2025