Initially this is just a pre-processed, Snorkel-format dump of PubTator. We will be adding more soon!
The easiest way to get started is to download a preprocessed Snorkel PostgreSQL database dump. This is a 142 GB file and is ready to use directly with Snorkel.
To reload, just use psql snorkel-biocorpus < snorkel_biocorpus.sql
- PubMed abstracts
XXX PubMed Abstracts
XXX 19XX - 2017
You can rebuild the entire PubTator database from scratch as follows:
run install.sh
This will download the current PubTator snapshot (~10GB compressed; 32GB raw) from ftp.ncbi.nlm.nih.gov
Parsing using 16 cores with the spaCy parser takes around XX hours. Parsing with CoreNLP will take longer.