Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore gene-gene interaction ingests. Potentially verify StringDB ingest doesn't include text mined associations #947

Open
DnlRKorn opened this issue Jan 30, 2025 · 0 comments
Assignees

Comments

@DnlRKorn
Copy link

@cmungall
Based upon discussion during Jan 30th, 2025 Monarch Data call. Discuss where we want to get gene-gene interaction data. Is StringDB meeting our needs, do we want towards more upstream data sources? At a minimum I believe we want to ensure we are pruning out text mined information.

Here are some notes

What files are currently coming from String and what data do they have.

From my notes we are ingesting the following 14 string files.

  • stringdb-downloads.org/download/protein.links.detailed.v12.0/10090.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/10116.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/227321.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/284812.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/44689.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/4932.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/6239.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/7227.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/7955.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/8364.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/9031.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/9606.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/9615.protein.links.detailed.v12.0.txt.gz
  • stringdb-downloads.org/download/protein.links.detailed.v12.0/9913.protein.links.detailed.v12.0.txt.gz

Looking at the format of "protein.links.detailed" files from StringDB, they have a columns for "experimental", "database", and "textmining", and "combined" scores

How is the combined score calculated by String.

Based upon information discussed here, http://version10.string-db.org/help/faq/, each score column in String has a prior, which is used to weight the score and then sum them.

What processing is currently happening to StringDB ingests

We can look at the processing step going on for String here -
https://github.com/monarch-initiative/monarch-ingest/blob/24d9e972b9cbb5263dc6f1f5380afc23cfc32cf3/src/monarch_ingest/ingests/string/protein_links.yaml#L46-L51. Our ingest filters based upon combined score which is based upon text mining. It may be better to enforce only an "experimental" column be greater than 0. Or some other curation methodology.

@DnlRKorn DnlRKorn self-assigned this Jan 30, 2025
@kevinschaper kevinschaper transferred this issue from monarch-initiative/monarch-ingest Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant