-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up a pipeline to extract SGD gene names only for a list of test papers #326
Comments
@draciti @vanaukenk Here's the results of the SGD genes extraction on 20 papers that are in corpus for both WB and SGD |
thanks @valearna ! |
@valearna -- Are we also applying TF-IDF with a threshold of 10 for the list of extracted genes for yeast? |
Yes, we are using TFIDF 10 also for SGD |
@vanaukenk I am done with the first half -FYI |
I've finished checking my S. cerevisiae papers. Generally, I thought the ACKnowledge pipeline did quite well, but one major thing I saw was that we will need to account for protein names that are different from gene names and decide how we want to find those. We missed some SGD gene associations because the mention was actually a protein, for example Puf4p instead of PUF4. |
Run this pipeline on caltech-curation-dev to evaluate how well the extraction works. Select a list of 20 test papers that are in corpus for WB and SGD and have C. elegans and S. Cerevisiae species tags. Store the results in a csv file with paper id and list of sgd gene ids and names
The text was updated successfully, but these errors were encountered: