The KnOwledge graPh-augmented enTity lInking approaCh (OPTIC) is an approach for the disambiguation step of the Entity Linking (EL) task, based on deep neural network and knowledge and word embedding. Different from other approaches, we train the knowledge and word embedding simultaneously by using the model fastText [2-3].
The approach was developed in python 3.7, working fine on python 3.6. The code depends on the following packages:
- pynif 0.1.4
- flask 1.1.1
- bcolz 1.2.1
- numpy 1.17.2
- torchvision 0.4.0
- pytorch 1.2.0
- fuzzysearch 0.6.2
- nltk 3.4.5
- elasticsearch 7.0.4
Moreover, OPTIC uses the version 0.3.2 of the Tweet NLP tool [4-5]. Besides the Tweet NLP, all remaining packages can be obtained via pip.
This repository contains a script that transforms the resulting file of the fastText model, i.e., the embedding file on .vec format, to the format usable by OPTIC. Foremost, we consider that the fastText model has already trained the knowledge and word embeddings. In [1], we detail the process to train them jointly.
Before the usage of the script, it is necessary to change the value of the variables EMBEDDING_DIM (line 8) and data_path (line 9) according to your needs. EMBEDDING_DIM refers to the dimension size of the embeddings, while data_path refers to the embeddings file path.
To use the script, execute the command:
python3.7 prepare_embedding_matrix.py
The entity candidates employed on OPTIC is the same index employed on AGDISTIS/MAG [6]. Please consult their wiki to learn how to recreate the index. Moreover, we provide our mapping for the ElasticSearch to help the replication.
The model folder presents the code to train our Neural Network model based on knowledge and word embedding trained jointly. To train the model, execute the command:
python3.7 train.py
Moreover, it is necessary to declare the following arguments:
--batch Size of batch
--output_dim Dimension of the output layer (always 1 for our model)
--embedding Dimension of the embeddings
--dropout Dropout value
--epoch Number of epochs that the training will run
--hidden Number of cells for hidden layer(s)
--layer Number of hidden layers
--datapath Data files path
--dataset Dataset name
--ws Size of the window context
--rank Flag to consider or not the popularity (Not implemented yet, always consider the popularity)
Lastly, in our case, we use the following pattern for, respectively, our train and test set: train_DATASET.json and test_DATASET.json. The DATASET refers to the dataset name specified by the argument --dataset.
OPTIC can run both locally or as a Web Service. Example of the command to run locally:
python3.7 run.py
Example of the command to run as Web Service:
export FLASK_APP=run.py
flask run
OPTIC requires several arguments, detailed as following:
General type arguments:
--mode Type of experiment that OPTIC will execute (a2kb,d2kb). Currently only supporting d2kb
--input Folder path with NIF files to be disambiguated
--data Other data files (like model) path
--verbose show progress messages (yes, no)
Neural Network type arguments:
--embed Dimension of the embeddings
--hidden Number of cells for hidden layer(s)
--layer Number of hidden layers
--dropout Dropout value
--batch Size of batch
--extra Flag for extra attributes in the Neural Network model (0 = None, 1 = popularity)
ElasticSearch type arguments:
--type Type of elasticsearch query
--max Max of documents returned by elasticsearch in queries
--boost Boost for exact match in multi-match queries
Disambiguation specific arguments:
--threshold Threshold to consider the miminum score of the disambiguation step
--ws Size of the window context
These arguments can also be declared on the configuration file. An example of such a file is present on this repository.
The input of OPTIC is a microblog text, with the named entity mentions already recognized. The input must follow the NIF standard. The output is the microblog text with the named entity mentions disambiguated. The output is a file following the NIF standard.
When running locally, OPTIC will disambiguate all NIF files contained in the specified folder. Running as Web Service, it is necessary to pass each NIF file individually.
To submit bugs and feature requests, please report at project issues.
The table below presents the results of OPTIC on the GERBIL benchmark system compared with others EL approaches. The datasets NEEL2014, NEEL2015, and NEEL2016 represent, respectively, the datasets microposts2014-Test, microposts2015-Test, microposts2016-Test available on GERBIL. The ERR value indicates that the annotator caused to many single errors on GERBIL.
F1@Micro | NEEL2014 | NEEL2015 | NEEL2016 |
---|---|---|---|
ADEL | 0.591 | 0.783 | 0.801 |
AGDISTIS/MAG | 0.497 | 0.719 | 0.616 |
AIDA | 0.412 | 0.414 | 0.183 |
Babelfy | 0.475 | 0.341 | 0.157 |
DBpedia Spotlight | 0.452 | ERR | ERR |
FOX | 0.252 | 0.311 | 0.068 |
FREME NER | 0.419 | 0.313 | 0.162 |
OpenTapioca | 0.215 | 0.259 | 0.053 |
OPTIC | 0.2906 | 0.3362 | 0.5089 |
The OPTIC results were obtained by using the following parameter values:
- embed=200
- hidden=200
- layer=2
- dropout=0.5
- batch_size=1
- extra=1
- type=multi
- max=100
- boost=5
- threshold=0.7
- ws=3
The OPTIC model already trained can be obtained here.
[1] Italo Lopes Oliveira, Luís Paulo Faina Garcia, Diego Moussallem and Renato Fileto. (2020). OPTIC: KnOwledge graPh-augmented enTity lInking approaCh. To be published.
[2] Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov. (2017). Transactions of the Association for Computational Linguistics. (5)135-146.
[3] Armand Joulin, Edouard Grave, Piotr Bojanowski, Maximilian Nickel and Tomas Mikolov. (2017). Fast Linear Model for Knowledge Graph Embeddings.
[4] Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan and Noah A. Smith. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. (2011). Proceedings of ACL.
[5] Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider and Noah A. Smith. (2013). Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters. Proceedings of NAACL.
[6] Diego Moussallem, Ricardo Usbeck, Michael Röeder, Axel-Cyrille Ngonga Ngomo. (2017). MAG: A multilingual, knowledge-base agnostic and deterministic Entity Linking approach. Proceedings of the Knowledge Capture Conference.