Skip to content

Quick Start

Vered Shwartz edited this page Oct 11, 2016 · 10 revisions

Requirements

Usage Instructions

1. Clone the repository or download the scripts.

The repository contains the following directories:

  • common - the knowledge resource class, which is used by other models to save the path data from the corpus.
  • corpus - code for parsing the corpus and extracting paths.
  • dataset - the datasets used in the paper.
  • train - code for training and testing the LexNET model, and pre-trained models for the datasets.

2. Get a parsed corpus, by either:

  • Downloading the corpus used in the paper: We used the English Wikipedia dump from May 2015 as the corpus. We computed the paths between the most frequent unigrams, bigrams and trigrams in Wikipedia (based on GloVe vocabulary and the most frequent 100k bigrams and trigrams). The files for the Wiki corpus are available here.
  • Creating a custom parsed corpus: run the script parse_wikipedia and then create_resource_from_corpus. The first script creates a triplet file of the paths, formatted as: x\ty\tpath. The second script creates the .db files under the provided directory. See the Detailed Guide for additional information.

3. Get a dataset:

The folllowing datasets, used in the paper, are available in the datasets directory. Each dataset is split to train, test and validation sets.

Alternatively, you can provide your own dataset. The directory needs to contain 3 files, whose names end with '_train.tsv', '_val.tsv', and '_test.tsv' for the train, validation, and test sets respectively. Each line is a separate entry, formatted as x\ty\trelation.

4. Train the model:

Run the script train_integrated.py. The script trains several models, tuning the word dropout rate and the learning rate using the validation set. The best performing model on the validation set is saved and evaluated on the test set. See the Detailed Guide for additional information.

Since the datasets we used in this work differ from each other, we recommend training the model rather than using pretrained models. If you prefer using our pretrained models, some are available here.