Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up a pipeline to extract SGD gene names only for a list of test papers #326

Open
valearna opened this issue Jan 13, 2025 · 8 comments
Open
Assignees
Labels

Comments

@valearna
Copy link
Collaborator

valearna commented Jan 13, 2025

Run this pipeline on caltech-curation-dev to evaluate how well the extraction works. Select a list of 20 test papers that are in corpus for WB and SGD and have C. elegans and S. Cerevisiae species tags. Store the results in a csv file with paper id and list of sgd gene ids and names

@valearna valearna added the SGD label Jan 13, 2025
@valearna
Copy link
Collaborator Author

@valearna
Copy link
Collaborator Author

@draciti @vanaukenk Here's the results of the SGD genes extraction on 20 papers that are in corpus for both WB and SGD

@draciti
Copy link
Collaborator

draciti commented Jan 27, 2025

thanks @valearna !
@vanaukenk and I will check those and we will get back to you.
@draciti will take the first half, @vanaukenk the second half.
Will evaluate if the genes extracted for SGD are correct, and if there was any missed also for the papers that did not have any gene extracted.

@draciti
Copy link
Collaborator

draciti commented Jan 27, 2025

@draciti
Copy link
Collaborator

draciti commented Jan 29, 2025

@valearna -- Are we also applying TF-IDF with a threshold of 10 for the list of extracted genes for yeast?

@valearna
Copy link
Collaborator Author

Yes, we are using TFIDF 10 also for SGD

@draciti
Copy link
Collaborator

draciti commented Jan 31, 2025

@vanaukenk I am done with the first half -FYI

@vanaukenk
Copy link
Collaborator

I've finished checking my S. cerevisiae papers.

Generally, I thought the ACKnowledge pipeline did quite well, but one major thing I saw was that we will need to account for protein names that are different from gene names and decide how we want to find those. We missed some SGD gene associations because the mention was actually a protein, for example Puf4p instead of PUF4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants