-
Notifications
You must be signed in to change notification settings - Fork 21
Quick Start
The repository contains the following directories:
- common - the knowledge resource class, which is used by other models to save the path data from the corpus.
- corpus - code for parsing the corpus and extracting paths.
- dataset - the datasets used in the paper.
- train - code for training and testing the LexNET model, and pre-trained models for the datasets.
- Downloading the corpus used in the paper: We used the English Wikipedia dump from May 2015 as the corpus. We computed the paths between the most frequent unigrams, bigrams and trigrams in Wikipedia (based on GloVe vocabulary and the most frequent 100k bigrams and trigrams). The files for the Wiki corpus are available here.
- Creating a custom parsed corpus: run the script
parse_wikipedia
and thencreate_resource_from_corpus
. The first script creates a triplet file of the paths, formatted as:x\ty\tpath
. The second script creates the .db files under the provided directory. See the Detailed Guide for additional information.
The folllowing datasets, used in the paper, are available in the datasets directory. Each dataset is split to train, test and validation sets.
- K&H+N (Necsulescu et al., 2015)
- BLESS (Baroni and Lenci, 2011)
- EVALution (Santus et al., 2015)
- ROOT09 (Santus et al., 2016)
Alternatively, you can provide your own dataset. The directory needs to contain 3 files, whose names end with '_train.tsv', '_val.tsv', and '_test.tsv' for the train, validation, and test sets respectively. Each line is a separate entry, formatted as x\ty\trelation
.
Run the script train_integrated.py
. The script trains several models, tuning the word dropout rate and the learning rate using the validation set. The best performing model on the validation set is saved and evaluated on the test set.
See the Detailed Guide for additional information.
Since the datasets we used in this work differ from each other, we recommend training the model rather than using pretrained models. If you prefer using our pretrained models, some are available here.