Snorkel BioCorpus

Initially this is just a pre-processed, Snorkel-format dump of PubTator. We will be adding more soon!

Database Snapshot

The easiest way to get started is to download a preprocessed Snorkel PostgreSQL database dump. This is a 142 GB file and is ready to use directly with Snorkel.

To reload, just use psql snorkel-biocorpus < snorkel_biocorpus.sql

Sources

PubMed abstracts

Summary Statistics

XXX PubMed Abstracts
XXX 19XX - 2017

Entity Tags

Genes (GNormPlus)
Diseases (DNorm)
Chemicals (tmChem)
Species (SR4GN)
Mutations (tmVar)

Building the Database

Full PubTator Snapshot

You can rebuild the entire PubTator database from scratch as follows:

run install.sh

This will download the current PubTator snapshot (~10GB compressed; 32GB raw) from ftp.ncbi.nlm.nih.gov

Parsing using 16 cores with the spaCy parser takes around XX hours. Parsing with CoreNLP will take longer.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
OLD		OLD
data		data
embeddings		embeddings
etl/pubmed		etl/pubmed
pubtator		pubtator
.gitignore		.gitignore
README.md		README.md
custom_cand_generator.py		custom_cand_generator.py
demo.ipynb		demo.ipynb
extract_pubmed_text.sh		extract_pubmed_text.sh
install.sh		install.sh
ncbi-disease-8grams-np.db		ncbi-disease-8grams-np.db
parse_pubtator.py		parse_pubtator.py
set_env.sh		set_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snorkel BioCorpus

Database Snapshot

Sources

Summary Statistics

Entity Tags

Building the Database

Full PubTator Snapshot

About

Releases

Packages

Contributors 2

Languages

HazyResearch/snorkel-biocorpus

Folders and files

Latest commit

History

Repository files navigation

Snorkel BioCorpus

Database Snapshot

Sources

Summary Statistics

Entity Tags

Building the Database

Full PubTator Snapshot

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages