A NER model (B-LSTM + CRF + word embeddings) implemented using Tensorflow which is used to tag Turkish noisy data (tweets specifically!) without using any hand-crafted features or rules.
The model is very similar to Lample et al., Gungor, Onur et al. and Ma and Hovy. As a consequence, the source code is also heavily influenced by Guillaume Genthial's sequence_tagging and Guillaume Lample's tagger projects.
- Python (3 or newer)
- pip, virtualenv, make
- Creating isolated environment with:
virtualenv -p /usr/bin/python3 virtual-env
source virtual-env/bin/activate
pip install -r requirements.txt
Hint: If you are done working, type
deactivate
to exit virtual environment.
- Download the word2vec vectors with
make word2vec
Alternatively, you can download them manually here and update the filename_word2vec
entry in config.py
. You can also choose not to load pretrained word vectors by changing the entry use_pretrained
to False
in model/config.py
.
- Build the training data, train and evaluate the model with
make run
Here is the breakdown of the commands executed in make run
:
- Build vocab from the data and extract trimmed word2vec vectors according to the config in
model/config.py
.
python build_data.py
- Train the model with
python train.py
# Or redirect everything into a log file and detach the process by typing:
# python train.py >> out.log 2>&1 & disown
- Evaluate and interact with the model with
python evaluate.py
Data iterators and utils are in model/data_utils.py
and the model with training/test procedures is in model/ner_model.py
The training data must be in the following format (identical to the CoNLL2003 dataset).
A default test file is provided to help you getting started.
John B-PER
lives O
in O
New B-LOC
York I-LOC
. O
This O
is O
another O
sentence
Once you have produced your data files, change the parameters in config.py
like
# dataset
filename_dev = "data/tr.testa.iobes"
filename_test = "data/tr.testb.iobes"
filename_train = "data/tr.train.iobes"
This project is licensed under the terms of the apache 2.0 license (as Tensorflow and derivatives). If used for research, citation would be appreciated.