Skip to content

SemanticComputing/pertti

Repository files navigation

pertti

About

Named entity recognition built on top of BERT and keras-bert. This guide is updated version from the original one: https://github.com/jouniluoma/keras-bert-ner

The service is a compilation from J. Luoma's and S. Pyysalo's original demos.

API

The service has also a usable API for testing. The service API description can be found from Swagger.

Publications

  • Minna Tamper, Arttu Oksanen, Jouni Tuominen, Aki Hietanen and Eero Hyvönen: Automatic Annotation Service APPI: Named Entity Linking in Legal Domain. The Semantic Web: ESWC 2020 Satellite Events (Harth, Andreas, Presutti, Valentina, Troncy, Raphaël, Acosta, Maribel, Polleres, Axel, Fernández, Javier D., Xavier Parreira, Josiane, Hartig, Olaf, Hose, Katja and Cochez, Michael (eds.)), Lecture Notes in Computer Science, vol. 12124, pp. 208-213, Springer-Verlag, 2020.

Dependencies:

Notice that this project uses tensorflow 1.11 (also install tensorflow-gpu: pip install tensorflow-gpu==1.14), make sure to downgrade tensorflow to use this project or use it in it's own sandbox. Some of the dependencies are added as a part of the installation.

Install the following dependencies preferably in the given order:

Pretrained BERT model, e.g. from:

input data e.g. from:

Input data is expected to be in CONLL:ish format where Token and Tag are tab separated. First string on the line corresponds to Token and second string to Tag

Quickstart

Get submodules

git submodule init
git submodule update

Get pretrained models and data

./scripts/get-models.sh
./scripts/get-finer.sh
./scripts/get-turku-ner.sh

Experiment on Turku NER corpus data (run-turku-ner.sh trains, predict-turku-ner.sh outputs predictions)

./scripts/run-turku-ner.sh
./scripts/predict-turku-ner.sh
python compare.py data/turku-ner/test.tsv turku-ner-predictions.tsv 

Run an experiment on FiNER news data

./scripts/run-finer-news.sh

Start the pertti-service (notice that the --ner_model_dir parameter can be changed)

python serve.py --ner_model_dir finer-news-model/

All options to run the service are:

  • -h, --help (help)
  • --batch_size (Batch size for training)
  • --output_file (File to write predicted outputs to)
  • --ner_model_dir (Trained NER model directory)

(the first job must finish before running the second.)

Usage

The service will by default use the url: http://127.0.0.1:8080.

In order to do Named Entity Recognition with the service, use the following parameters with the request:

  • text (required): the text to be annotated with named entities
  • format (optional): the format in which the results are returned. The service currently supports only json and raw output formats. To get output in JSON format the user must give this parameter value 'json'. By default without giving this option, the results are returned in raw format.

The service supports POST requests.

Example requests and their outputs

Request where output is given in raw format

Request:

 http://127.0.0.1:8080?text=Presidentti Tarja Halosen elämän ääniraitaan mahtuu muistoja työskentelystä Englannissa, Tapio Rautavaaran Halosen äidille kohdistamista kosiskeluyrityksistä, sekä omista häistään.

Or as a POST request:

curl -H "Content-type: text/plain; charset=utf-8" -d "Presidentti Tarja Halosen elämän ääniraitaan mahtuu muistoja työskentelystä Englannissa, Tapio Rautavaaran Halosen äidille kohdistamista kosiskeluyrityksistä, sekä omista häistään." http://127.0.0.1:8080

Output:

Presidentti	O
Tarja	B-PER
Halosen	I-PER
elämän	O
ääniraitaan	O
mahtuu	O
muistoja	O
työskentelystä	O
Englannissa	B-LOC
,	O
Tapio	B-PER
Rautavaaran	I-PER
Halosen	B-PER
äidille	O
kohdistamista	O
kosiskeluyrityksistä	O
,	O
sekä	O
omista	O
häistään	O
.	O

Request where output is given in JSON format

In this output format the JSON returns in addition to the named entities and their types, also their locations in the given text.

Request:

 http://127.0.0.1:8080?text=Presidentti Tarja Halosen elämän ääniraitaan mahtuu muistoja työskentelystä Englannissa, Tapio Rautavaaran Halosen äidille kohdistamista kosiskeluyrityksistä, sekä omista häistään.&format=json

Output:

[
    {
        "end": 25,
        "start": 12,
        "text": "Tarja Halosen",
        "type": "PER"
    },
    {
        "end": 87,
        "start": 76,
        "text": "Englannissa",
        "type": "LOC"
    },
    {
        "end": 106,
        "start": 89,
        "text": "Tapio Rautavaaran",
        "type": "PER"
    },
    {
        "end": 114,
        "start": 107,
        "text": "Halosen",
        "type": "PER"
    }
]

Docker

For building the Docker container image, be sure to have the submodule bert fetched:

git submodule init
git submodule update

Option 1: self-contained Docker image including language models and the NER model trained on the combined FiNER news and Turku NER corpus

Build: docker build -f Dockerfile.self-contained -t pertti-self-contained .

Run: docker run -it --rm -p 5000:5000 --name pertti pertti-self-contained

Option 2: smaller Docker image without pretrained language and NER models

For running the container, you need to have the existing language models and a NER model on the host machine, and pass them to container as a bind mount or on a volume.

E.g. download and unpack the following model distributions and create the following directory structure for them on mount/volume:

You can run the script ./get-models.sh to download the models (language models and the combined NER model based on FiNER news and Turku NER corpus) into directory models.

Build: docker build -t pertti .

Run: docker run -it --rm -p 5000:5000 --mount type=bind,source="$(pwd)"/models,target=/app/models -e NER_MODEL_DIR=/app/models/combined-ext-model --name pertti pertti

The service listens on http://localhost:5000

Train NER model

To train a NER model:

Build:

docker build -f Dockerfile.train -t pertti-train .

Run:

E.g. train a model using FiNER news corpus:

mkdir finer-ner-model
docker run -it --rm --cpus=4 --mount type=bind,source="$(pwd)"/finer-ner-model,target=/app/finer-news-model --name pertti-train pertti-train /bin/bash -c "./scripts/get-models.sh && ./scripts/get-finer.sh && ./scripts/run-finer-news.sh"

E.g. train a model using Turku NER corpus:

mkdir ner-models
docker run -it --rm --cpus=4 --mount type=bind,source="$(pwd)"/ner-models,target=/app/ner-models --name pertti-train pertti-train /bin/bash -c "./scripts/get-models.sh && ./scripts/get-turku-ner.sh && ./scripts/run-turku-ner.sh"
mv ner-models/turku-ner-model .

E.g. train a model using combined FiNER news and Turku NER corpus:

mkdir combined-ner-model
docker run -it --rm --cpus=4 --mount type=bind,source="$(pwd)"/combined-ner-model,target=/app/combined-model --name pertti-train pertti-train /bin/bash -c "./scripts/get-models.sh && ./scripts/get-combined.sh && ./scripts/run-combined.sh"

You can also download the language models and NER model training data on your host machine and pass them to container as a bind mount or on a volume. In such case, you only need to run in the container the last command of the above docker run examples, e.g., ./scripts/run-combined.sh.