blstm-crf-ner

A NER model (B-LSTM + CRF + word embeddings) implemented using Tensorflow which is used to tag Turkish noisy data (tweets specifically!) without using any hand-crafted features or rules.

The model is very similar to Lample et al., Gungor, Onur et al. and Ma and Hovy. As a consequence, the source code is also heavily influenced by Guillaume Genthial's sequence_tagging and Guillaume Lample's tagger projects.

Prerequisites

Python (3 or newer)
pip, virtualenv, make

Getting started

Creating isolated environment with:

virtualenv -p /usr/bin/python3 virtual-env
source virtual-env/bin/activate
pip install -r requirements.txt

Hint: If you are done working, type deactivate to exit virtual environment.

Download the word2vec vectors with

make word2vec

Alternatively, you can download them manually here and update the filename_word2vec entry in config.py. You can also choose not to load pretrained word vectors by changing the entry use_pretrained to False in model/config.py.

Build the training data, train and evaluate the model with

make run

Details

Here is the breakdown of the commands executed in make run:

Build vocab from the data and extract trimmed word2vec vectors according to the config in model/config.py.

python build_data.py

Train the model with

python train.py
# Or redirect everything into a log file and detach the process by typing:
# python train.py >> out.log 2>&1 & disown

Evaluate and interact with the model with

python evaluate.py

Data iterators and utils are in model/data_utils.py and the model with training/test procedures is in model/ner_model.py

Training Data

The training data must be in the following format (identical to the CoNLL2003 dataset).

A default test file is provided to help you getting started.

John B-PER
lives O
in O
New B-LOC
York I-LOC
. O

This O
is O
another O
sentence

Once you have produced your data files, change the parameters in config.py like

# dataset
filename_dev = "data/tr.testa.iobes"
filename_test = "data/tr.testb.iobes"
filename_train = "data/tr.train.iobes"

License

This project is licensed under the terms of the apache 2.0 license (as Tensorflow and derivatives). If used for research, citation would be appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
data		data
eval		eval
helpers		helpers
model		model
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
build_data.py		build_data.py
cross-validation.py		cross-validation.py
evaluate.py		evaluate.py
makefile		makefile
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blstm-crf-ner

Prerequisites

Getting started

Details

Training Data

License

About

Releases

Packages

Languages

License

hilmiger/blstm-crf-ner

Folders and files

Latest commit

History

Repository files navigation

blstm-crf-ner

Prerequisites

Getting started

Details

Training Data

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages