Snorkel BioCorpus

Initially this is just a pre-processed, Snorkel-format dump of PubTator. We will be adding more soon!

Database Snapshot

The easiest way to get started is to download a preprocessed Snorkel PostgreSQL database dump. This is a 142 GB file and is ready to use directly with Snorkel.

To reload, just use psql snorkel-biocorpus < snorkel_biocorpus.sql

Sources

PubMed abstracts

Summary Statistics

XXX PubMed Abstracts
XXX 19XX - 2017

Entity Tags

Genes (GNormPlus)
Diseases (DNorm)
Chemicals (tmChem)
Species (SR4GN)
Mutations (tmVar)

Building the Database

Full PubTator Snapshot

You can rebuild the entire PubTator database from scratch as follows:

run install.sh

This will download the current PubTator snapshot (~10GB compressed; 32GB raw) from ftp.ncbi.nlm.nih.gov

Parsing using 16 cores with the spaCy parser takes around XX hours. Parsing with CoreNLP will take longer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Snorkel BioCorpus

Database Snapshot

Sources

Summary Statistics

Entity Tags

Building the Database

Full PubTator Snapshot

Files

README.md

Latest commit

History

README.md

File metadata and controls

Snorkel BioCorpus

Database Snapshot

Sources

Summary Statistics

Entity Tags

Building the Database

Full PubTator Snapshot