You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was a pretty helpful discussion on the google doc and I wanted to preserve some of the brainstorming/solution mining
- Don’t throw away the whole database on import. Instead, update existing claims as needed and mark claims as deprecated if they have been removed or the source data has marked them as rejected/deprecated
- Re-group these existing claims but keep track of which group they were assigned to before. Display a history of the assigned groups over time
- Regroup only when a claim changed? Or is this also necessary because the grouper themselves might assign a different group
- Some known problems:
- How to do this in real time? Right now we basically take the db offline when importing/grouping.
- Probably two options here. 1) Do we change the importers/normalizers/groupers to be robust enough to run inline on the prod instance (wrap everything in a transaction? Will that use too much memory?) 2) do we develop some sort of data format that we can output from the import pipeline that can be loaded into the running instance?
- How do we link existing claims to their entry in the source document. E.g. if it’s a TSV there is often no unique identifier to reference them
- TSVs shouldn’t change over time so there wouldn’t be a re-import
- Do a review of sources that actually change over time and confirm that there are unique identifies for at least the gene and drugs.
- Some source databases that change over time do have unique IDs for the interaction (e.g. CIViC with EIDs). If there isn’t, how do we make sure we can identify an interaction claim and link it to the same interaction from the source.
- Do we need to rethink how we import interaction claims. Currently we use a first_or_create causing multiple source entries with the same gene-drug combination to be imported as one claim. E.g. should multiple CIViC EIDs with the same gene drug combination be imported as one claim? Even if they are a mix of support/does not support?
- What should our unique identifier be for interaction claims and interaction groups
- James proposed some combination of grouped gene + grouped drug concept as a unique id. Could maybe hash that along with the source name to create some sort of stable id?
- Do we want to use stable ids from external sources when present, or rely on internal ids for consistency?
- (This only matters at the claim level)
- External IDs seem good if available. This is also a potentially useful UI element (link back to primary source).
Advantages:
- Persistent URLs/identifiers for DGIdb claims and groups
- easier linking from external resources to DGIdb data
- Enable leaving comments/change suggestions (Aim 1.4)
Ideal outputs:
- Data dumps of DGIdb database + normalizer DBs (for data provenance purposes), data TSVs. These are currently manually generated which is suboptimal
- Diffs of input data (e.g. for summarization in release notes)
The text was updated successfully, but these errors were encountered:
This was a pretty helpful discussion on the google doc and I wanted to preserve some of the brainstorming/solution mining
The text was updated successfully, but these errors were encountered: