diff --git a/content/post/project-ned-with-bert.md b/content/post/project-ned-with-bert.md new file mode 100644 index 0000000..fdbf342 --- /dev/null +++ b/content/post/project-ned-with-bert.md @@ -0,0 +1,985 @@ +--- +title: "Named Entity Disambiguation with BERT" +date: 2021-05-12T13:41:30+02:00 +author: "Amund Faller Råheim" +authorAvatar: "img/ada.jpg" +tags: [Named Entity Disambiguation, Entity Linking, NED, NLP, Deep Learning, BERT] +categories: [] +image: "img/project-ned-with-bert/BERT_NED_banner.png" +draft: false +--- + +Large transformer networks such as BERT have led to recent advancements in the NLP field. The contextualized token embeddings that BERT produces should serve as good input to entity disambiguation, which benefits from context. This master project aims to use BERT on the task of Named Entity Disambiguation. + + +# Content +- [Introduction](#intro) +- [BERT](#bert) +- [NER, Candidate Generation and Knowledge Base](#components) +- [BERT NED](#bert-ned) +- [Evaluation](#evaluation) +- [Reproducing the Results](#reproduce) +- [Summary and Future Work](#summary) + + + +# 1. Introduction {#intro} + +## Named Entity Disambiguation (NED) + +Knowledge extraction from natural language texts such as web sites and research articles is an important task in the field of Natural Language Processing (NLP). One aspect of knowledge extraction from documents is Named Entity Disambiguation (NED), which is useful in applications such as search engines. + +For the sake of this article, **NED** [is defined](https://github.com/sebastianruder/NLP-progress/blob/master/english/entity_linking.md) as the task of finding the correct entity in a knowledge base for a mention of an entity in a document. A **named entity**, or **entity** for the sake of this article, is a unique object that can be referred to by a proper name. The **knowledge base** is the database of entities that we use. We refer to a **mention** as all the words that belong to a mentioned entity in a document. + +
+ +For example, the sentence "Paris Hilton is visiting Paris this weekend" contains mentions of two entities: "Paris Hilton" and "Paris". + +
+ +Finding these mentions is the task of **Named Entity Recognition (NER)**, and is often done before and independently of NED. + +After NER, it is common to reduce the search for possible entities to a relevant subset of the knowledge base. The list of possible entities for a given mention is referred to as the **candidates** for that mention. The task of finding these candidates is called **candidate generation**. After the candidate generation step, the NED task is to **find the correct candidate** for each mention. + +
+ +These three tasks — NER, candidate generation, and NED — are easily understood with an example. Consider the following two sentences: + +* Paris is the capital and most populous city of France. +* Paris is an American media personality. + +The NER system looks at these sentences and recognizes the mentions "Paris" and "France" in the first sentence, and "Paris" in the second. + +
+ +The candidate generation system takes **only the mention** as input and outputs all the entities in the knowledge base that could be referred to with that name. + + +
+ +Consider the two mentions of "Paris" in the first and second sentence of our example; the candidate generation system takes only the text "Paris" as input, and so generates the same list of candidates for both mentions. The candidate list may be something like this: + +* Paris, capital and largest city of France; +* Paris Hilton, American socialite and media personality; +* Sven Paris, Italian boxer; +* etc. + +
+ +When we look at the example sentences, we see from their contexts that they obviously refer to different entities of "Paris". To successfully find the right candidate in the NED step (i.e. to disambiguate), it is essential that we can use the context effectively. + + + +Natural language processing methods usually send the input text through a process of **word embedding**. Word embeddings are numerical vector representations of words. Embedding methods vary in how they use context to make these representations. + +[Word2vec](https://arxiv.org/abs/1301.3781) is a popular method that does not use context to represent individual words in a document. Both instances of "Paris", the city and the person, will have the same vector regardless of their contexts. + +In contrast to Word2vec, the neural network model "BERT" computes **highly contextualized** word embeddings. With word embeddings from BERT, the vector for the word "Paris" in the first sentence will have rich information about the rest of the sentence. So will the word "Paris" in the second context. This makes it possible to use BERT word vectors to reach our goal of distinguishing "Paris, France" from "Paris, the socialite" in the two example sentences. + + +## Addressing NED with a Neural Network + +In this article, we will arrive at a neural network model for NED which builds on the model known as [BERT](https://arxiv.org/abs/1810.04805v2). We rely on external libraries for NER and candidate generation. That way, the task of our model is reduced to **picking one of the proposed candidates** for a given mention. + +We formulate the problem as a series of binary classifications (i.e. classification with only two target classes). For a mention with multiple candidates, we look at **each candidate separately**. For each mention-candidate-pair, we make a prediction to answer the *True or False* question "is this the correct candidate for this mention?". Finally, we choose the one candidate with the highest "True" prediction. + + + + +In the [next section](#bert), we will have a close look at how the BERT model works, which we will later use for the NED task. +In [Section 3](#components), we introduce the components we need to address the NER and candidate generation problems, which come prior to NED. +Then, in [Section 4](#bert-ned), we put the pieces together and show how we propose to solve NED with the model dubbed "BERT NED", along with details on how to train this model. +[Section 5](#evaluation) is dedicated to the evaluation of the model, and its performance on various benchmark datasets. +For the curious reader, [Section 6](#reproduce) details how to reproduce the results from this article. +Finally, [Section 7](#summary) gives a short summary of the achievements presented in this article. + + +# 2. BERT {#bert} + +The neural network architecture called BERT (Devlin et al.) was introduced in 2018, and has become the state-of-the-art in many Natural Language Processing (NLP) tasks. In particular, BERT is interesting because of how well the **pre-trained model generalizes** to a wide array of NLP tasks. BERT also **uses context** very effectively to represent all the individual word tokens. As we have established, using the context of an entity is essential to solving the NED problem. This is exactly the observation that motivates us to use BERT for NED. + +## Pre-trained BERT + +BERT is pre-trained to learn a good **representation of language** before it is applied to any specific language tasks. In fact, BERT is pre-trained on two tasks ("Masked Language Model" and "Next Sentence Prediction"). + +
+ + Further reading: About the two pre-training tasks. + + +The two self-supervised tasks are dubbed "Masked Language Model" and "Next Sentence Prediction". Masked Language Model is a **token-level task**, where BERT predicts missing tokens in the input text. The tokens are replaced by a 'MASK' token. + +The Next Sentence Prediction task is a binary prediction task. Two sequences of text are either sampled sequentially from the same document, or are randomly sampled from different documents. The task is to classify whether they follow each other in the same document. This is a **sequence-level task**, which allows BERT to learn about higher level language contexts across two sentences or sequences of text. + +The two tasks are learned jointly. In practice, that means they are combined in the same loss function (by summation), and trained at the same time. Both the tasks are **"self-supervised"**, which means that they train on unlabelled data by generating their own labels. The masking positions for the Masked Language Model task and the two sentences for the Next Sentence Prediction task are randomly sampled, and the label is known. Because the generation of training data is automated, it is easy to get a lot of training data. + +
+ +The BERT model, after having trained on these two tasks, already performs well for many NLP tasks. We will later see that this includes our application to NED. If we choose to also **"fine-tune"** the model by training parts or all of the BERT model, we can expect an even better performance. + +## Tokenization + +BERT uses [WordPiece tokenizers](https://arxiv.org/pdf/1609.08144.pdf) to convert text to digestible input sequences. This particular tokenization scheme is good at dealing with rare words. It has a vocabulary of 30.000 tokens, some of which are whole words, and some word pieces. + +
+ +For example, the word "gibberish" turns into three tokens: ['gi', '##bber', '##ish']. The latter two tokens are part of the same word as the previous token, as characterized by the prepended '##'. On the other hand, the word "Paris" is simply tokenized as ['paris'], meaning that this word is part of the vocabulary. + +
+ +## Architecture + +BERT's architecture has **three distinct components**: the input layer, a stack of "encoders", and output layers. We will look at each of these in turn. + +The figure below shows the typical BERT architecture, with an input embedding layer, a stack of encoders and some output layers. + + + +When passing the sequence of tokens to BERT, the tokens are represented by unique IDs. In the **input layer** of BERT, each of these token IDs are mapped to an initial **embedding vector** representing that token, which is learned during training. The initial vectors are "static embeddings", meaning they do not have any information about the context. Each token is represented by an embedding vector of size 768. This vector size is also used between all the internal ("hidden") layers of the model. + +During pre-training, the BERT model has two **output layers**: one for the "Masked Language Model" task and one for the "Next Sentence Prediction" task. These are omitted after pre-training. Meanwhile, the rest of the network still has a good understanding of language from the pre-training phase. By appending new **task-specific output layers** to this architecture, the BERT model is ready to be trained for new tasks. + +The **encoders** between the input and output layer do all the heavy lifting. The first encoder takes the initial vector embedding from the input layer as input, and the subsequent encoders take the output of the previous encoder as input. If we have 512 input tokens, this is always a matrix of size 512 × 768, where 768 is the length of each token vector. + +Each encoder computes a **new vector representation** of the tokens from the vectors of the previous encoder. The new vector representation of a token is calculated using the vectors of all the other tokens. One can think of this as every token looking at every other token in the sequence to learn more about its own context. + +
+ + Further reading: Bi-directionality puts the 'B' in BERT. + + +The 'B' in BERT stands for "bi-directional" exactly because each token can look both in front of and behind itself. This type of bi-directionality is believed to be a key ingredient in BERT's success at context understanding, and is an important difference to recurrent neural network (RNN) approaches such as "ELMo". + +
+ +Though there are multiple encoders in BERT, they all have the same architecture. Most important is the so-called **"self-attention"** operation, of which there are multiple in parallel in each encoder. The parallel attention operations are called "attention heads". In short, self-attention is a weighted dot-product of the input tensor with itself. This is how each token can be represented by information from all other tokens. + +
+ +Take our two example sentences again: + +* Paris is the capital and most populous city of France. +* Paris is an American media personality. + +With the self-attention operation, the token for "Paris" in the first sentence will pick up information from all the other tokens in the sentence. In particular, the occurrence of "France" may be a strong hint that we are talking about the city rather than the media personality in the first sentence. Now we see why BERT is the master of context. + +
+ +The figure below shows an example of how the word "Paris" may attend to the other words in the sentence when using self-attention. The strength of the line shows how much "Paris" attends to that word. The occurrences of "capital", "city" and "France" in the context seems to be particularly interesting. + + + +In the BERT architecture we use here, there are twelve encoders, and twelve attention heads in parallel in each encoder. The vector length is 768 for each token through the network. This architecture gives the model a total of around **110 million weights**. This exact architecture is commonly known as "BERT Base". We are treating all input to BERT as lower case (uncased), giving us the final architecture name, "BERT Base Uncased". + +
+ + Further reading: Other BERT architectures. + + +The BERT architecture can be expanded to **different sizes**. Most notable are the BERT Base and BERT Large architectures. BERT architectures are defined by **three architectural hyperparameters**: the number of encoders L, the number of attention heads in each encoder A, and the size of the vectors in the hidden layers H. + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelLAHTotal parameters
BERT Base1212768110 M
BERT Large24161024340 M
+ + + +
+ +
+ + Further reading: Special BERT tokens reveal some BERT secrets. + + +There are three **tokens** used in the BERT architecture that are worth remarking. We refer to them as **'CLS', 'SEP' and 'PAD'**, and they each teach us something about the internals of BERT. + +The **'CLS'** ("classification") token is used during pre-training as the only input token to the Next Sentence Prediction classification layer. That means this token needs to store a lot of information on the coherence of the input sentences. That makes this token useful for classification tasks that relate to the whole input, or a contrast between two sequences. + +The **'SEP'** ("separator") token is used during pre-training to separate the two sentences for the Next Sentence Prediction task. It becomes important for any tasks where we have two sequences as input. + + +The **'PAD'** ("padding") token is appended to sequences that are shorter than the input sequence length of 512 tokens. This is simply because BERT always expects inputs to have the same length of 512 tokens. +
+ +
+ + Further reading: Three input vectors for BERT. + + +BERT requires **three input vectors** in total. They all have the same length. First, we have the tokenized input text sequence previously discussed. + +The second vector is important for tasks with two sequences. It is a binary vector with '0's in the position of tokens in the first sequence, and '1's in the position of the second sequence in the tokenized input token. + +The third vector is also a binary vector, and simply contains '1's where there are tokens in the tokenized input sequence, except for '0's where there are padding tokens. +
+ + +# 3. NER, Candidate Generation and Knowledge Base {#components} + +To lay the groundwork for entity disambiguation, we need to set up a full system with **Named Entity Recognition** and **candidate generation**. The [spaCy](https://spacy.io/api) library provides us with the means to perform these tasks. + + + +## NER + +The first step of analysing an input document is NER. We use a spaCy [language model](https://spacy.io/api/language) for the task. We use the '[en_core_web_lg](https://spacy.io/models/en#en_core_web_lg)' language model to process an input document and tags words as named entities. + +
+ + Further reading: The 'en_core_web_lg' language model. + + +The specific language model we use, 'en_core_web_lg', is around 742 MB large. It can perform many NLP tasks, but we only need it for NER. On the [models overview](https://spacy.io/models/en), spaCy suggests a precision of 86 % and recall of 85 % on named entities for this model. +
+ +
+ +We will use the following sentence as an example to see how it is handled by the different parts of the system. By convention, we will refer to this sentence as the **"input document"**: + +* Paris is the capital and most populous city of France. + +Let us say the spaCy language model recognizes two mentions of named entities in the example input document: + +* Paris is the capital and most populous city of France. + +These two **mentions**, "Paris" and "France", are what we use in the next component of the system. + +
+ +## Candidate Generation and Knowledge Base + +The Named Entity Disambiguation task requires a **knowledge base** as a **source of entities**. In our case, we are using a set of 4,111,690 Wikidata entities as our knowledge base. + +All entities in Wikidata are ascribed a unique and persistent ID called "QID" (a number prepended with a "Q"). An entity in Wikidata may also have a number of aliases for that entity, which are useful when we are searching for candidates. + +
+ +The Wikidata entry for Paris, the French capital, is [Q90](https://www.wikidata.org/wiki/Q90), with aliases such as "City of Light" and "Paris, France". The socialite Paris Hilton has the QID [Q47899](https://www.wikidata.org/wiki/Q47899), and no aliases. + +
+ +We use a [spaCy KnowledgeBase object](https://spacy.io/api/kb) to store our knowledge base. The KnowledgeBase object stores all the entities from Wikidata, along with the aliases of each entity. + +The KnowledgeBase object conveniently has a function for **candidate generation** which generates lists of candidates from the mentions recognized by the language model. The KnowledgeBase object returns all entities where the entity name or alias matches the mention. + +
+ +Consider these two input documents where the NER tagger has found the two underscored mentions: + +* Paris is the capital and most populous city of France. +* Paris is an American media personality. + +When we send the mention "Paris" to the KnowledgeBase object it will return the **same list** of candidates for both these mentions, even if they are in two different contexts. For example, it might give us the following list: + +* ['Q90', 'Q47899', 'Q580498'] + +We see that both "Q90" (the city) and "Q47899" (the American socialite) are among the candidates. This means that the candidate generation has done its job, and it is possible to disambiguate both mentions of "Paris" to their respective correct candidates. + +
+ +# 4. BERT NED {#bert-ned} + +After NER and candidate generation, the final step is to disambiguate mentions to one of their candidates. This is where we introduce our own **"BERT NED"** model. + +## Representation of the Mention and the Candidate + +In the input layer, BERT expects two concatenated sequences of tokens with a total of 512 tokens. We tokenize the input document as the first sequence (Sequence A). The input document has the context for the mention. As the second sequence (Sequence B), we want a text that puts the candidate in context. To this end, we use the **Wikipedia abstract** of the candidate. + +The maximum sequence length of 512 tokens is **shared** between the input document and the Wikipedia abstract. We simply take the 256 first tokens of the tokenized Wikipedia abstract. If the abstract is shorter than 256 tokens, we include more of the input document. Longer tokenized input documents are cut to length to fill up the remaining space. We make sure to keep the part where the mention occurs, even if that means missing tokens in the beginning of the input document. If necessary, the sequences are padded with 'PAD' tokens at the end to reach the required length of 512 tokens. + +
+ +We look at the example input document "Paris is the capital and most populous city of France." +Three candidates were found for the mention "Paris" by the candidate generation: + +* ['Q90', 'Q47899', 'Q580498'] + +Each of the three candidates requires an input sequence of Sequence A + Sequence B. Sequence A, which comes from tokenizing the input document, is **common for all** the three candidate's input sequences: + +* Sequence A: +
'CLS', 'paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', '.', 'SEP' + +The initial 'CLS' token is always prepended to BERT input sequences, and the final 'SEP' token says that this is the end of Sequence A. + +Sequence B, which is a chunk of the candidate's Wikipedia abstract, is **unique for each** of the three candidates: + +* Sequence B: + + 1. Candidate Q90: +
+'paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', ',', 'with', 'an', 'estimated', 'population', 'of', '2', ',', '175', ',', '60', '##1', 'residents', 'SEP' + + + 2. Candidate Q47899: +
+'paris', 'whitney', 'hilton', 'is', 'an', 'american', 'media', 'personality', ',', 'social', '##ite', ',', 'business', '##woman', ',', 'model', ',', 'singer', ',', 'actress', ',', 'and', 'dj', 'SEP' + + + 3. candidate Q580498: +
+'sven', 'paris', 'is', 'an', 'italian', 'amateur', 'boxer', 'who', 'competed', 'in', 'the', 'light', 'welterweight', 'division', 'at', 'the', '2000', 'summer', 'olympics', 'SEP' + + +
+ + + + +To sum up, we represent the **mention with the input document** and each **candidate with their Wikipedia Abstracts**. After generating the input vectors for each candidate, we are ready to feed them to the BERT NED model for disambiguation. + +## NED with BERT + +The **task of the BERT NED model** is to correctly classify if the input document in Sequence A and the candidate abstract in Sequence B are talking about the same entity. In other words, we have cast the problem as a binary classification task, with one data point for each candidate. + +When we pass an input sequence for a mention-candidate pair to BERT, the encoders sequentially compute new token vector embeddings. We use the embeddings generated in the final BERT encoder as input to a "classification module". The figure below shows the general architecture of BERT NED. + + + +The task of the **classification module** is to predict whether it thinks Sequence B is the Wikipedia abstract of the same entity that is mentioned in Sequence A. Before the classification module, BERT makes a representation of the tokens with rich contextual information to make the classification task easier. + +Note that the model does not consider the input sequences for a list of candidates from the same mention to belong together. For each candidate's input sequence, it simply outputs a classification of that candidate. + +We **rank the candidates** by the output prediction numbers, and if the highest ranked candidate is above a certain threshold, this is our final candidate. If it is below the threshold, we assume that the correct entity was not in the list of candidates, and predict none of the candidates. For now, we set the threshold to zero. In other words, a positive output number predicts the candidate to be correct. A negative number means the model predicts the candidate to be wrong. + +
+ +We return to our example with three candidates for the mention "Paris". We pass each pair of Sequence A (from the input document) and Sequence B (from each of the candidates) to BERT NED, and get **one output prediction for each candidate**. We rank the candidates by the prediction number and get a list like this: + + + + + + + + + + + + + + + + + + + + + + + + +
Rank + Candidate ID + Prediction +
1.'Q90'9.75
2.'Q47899'0.64
3.'Q580498'-9.57
+ + + +In this case, the model is most confident about the first candidate. Even if the second candidate gets a positive value and is above the threshold, we pick "Q90" to be the final prediction. Which also turns out to be correct! + +
+ +## The CoNLL Dataset + +We are almost ready to train some real models, but first let us have a look at the main dataset used for training and validation. [AIDA CoNLL-YAGO](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/ambiverse-nlu/aida/downloads/) (CoNLL) is a dataset of news articles from Reuters, annotated with entities from the YAGO2 knowledge base and links to Wikipedia articles. As we are using Wikidata as a knowledge base, we use a mapping from Wikipedia to Wikidata to get unique Wikidata IDs. + +
+ +To illustrate the annotation of this dataset, let us consider an example sentence. If a document in the dataset contains the sentence "Paris is the capital of France", the word "Paris" will be linked to the Wikipedia article with the URL http://en.wikipedia.org/wiki/Paris, and the same for "France". Using the mapping from Wikipedia to Wikidata, we end up with the Wikidata QID "Q90". + +
+ +Some entity mentions may not be linked to entities in the knowledge base. We assume that these have not been annotated because the entities they refer to are not in the knowledge base. Because they do not have a label, we ignore them during training. + +
+ + Further reading: Dataset split and mention statistics + + +We use the **official split** of the dataset, with the first 946 documents as training data, the next 216 as a validation set (dubbed "test-a"), and the final 231 documents as a test set (dubbed "test-b"). More details on the dataset can be seen in the table below. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DatasetArticlesMentionsLabelled mentions
Training9462339618330
Validation21659174752
Testing23156164452
Total13933492927534
+ + +
+ + +## Prototyped Models {#models} + +The archetypical BERT NED model consists firstly of a BERT network, which outputs an embedding for each token in the final layer, and secondly a **fully connected classification network** which takes the token embeddings as input. + +In order to **explore different architectures** of the classification layers, we have evaluated multiple architectures on a smaller part of the CoNLL training dataset. For convenience, this smaller dataset was roughly balanced to have at most two candidates for each mention: (1) the correct (ground truth) candidate when available (positive example), and (2) one other candidate (negative example). This gives around 33.000 candidate data points. + +When training the different prototype architectures, the weights of the BERT network (the first part of the model) were frozen and shared between all prototypes. In other words, we did not *fine-tune* any part of the BERT networks. It may seem like a big disadvantage to keep such a large part of the network fixed, but because the initial pre-trained BERT already has a good language understanding, we can still [expect good results](https://arxiv.org/pdf/1903.05987.pdf). Furthermore, we are mainly interested in seeing the performance difference between different classification architectures, and by training only a smaller part of the network we can speed up training a lot. + +Each of the prototype architectures were trained for five epochs, or until the accuracy did not improve from one epoch to the next ("early stopping"). Each epoch took around 18 minutes on Google Colab on a single NVIDIA Tesla T4 GPU. + +### Factors of the Classification Architecture + +The prototyped classification architectures vary in **two distinct ways**: +1. whether they have **one or two classification layers**, and +2. which token embeddings from BERT are used as **input** to these layers. + +Let us take a look at the input we can send to the classification network. + +When BERT is pretrained, the special **'CLS' token** is used as the only input token to the "Next Sentence Prediction" task. This task is a binary classification of the whole input sequence (answering the question "does Sequence B follow Sequence A in a document?"). Our formulation of the NED task is also a binary classification of sequences, so the 'CLS' token may prove useful. Using fewer tokens as input can allow us to have a smaller classification architecture. + +We prototype two models with only the 'CLS' token as input to classification. These are Model 1 and Model 2 in the table below, with one and two classification layers respectively. + +Instead of using only the 'CLS' token as input, we can also envision a model using **all the 512 token embeddings** from BERT. This greatly increases the number of weights in the classification module, particularly when using two layers. With 512 tokens of length 768, this yields around four hundred thousand input values to the classification layers. Compare this with 768 when using only the 'CLS' token. + +In the table below, Model 3 and Model 4 use all the 512 token embeddings as input. Model 4, with two fully connected layers, has around 300 million additional weights! + +A third way is to use the **'CLS' token along with some other hand picked tokens**. Specifically, we look at using one token corresponding to the mention and one token corresponding to the candidate. This gives only a moderate increase in the number of weights. For three tokens, we get around two thousand input values to the classification layers. + +During the forward pass, we need to distinguish these tokens in the sequence of 512 tokens. For the sake of speed and ease of implementation, we make sure the mention and candidate tokens always **appear in the same place**. To achieve this, we place the mention right after the leading 'CLS' token at the start of Sequence A. We also place the title of the candidate's Wikipedia article before its Wikipedia abstract at the start of Sequence B. + +
+ +If we look at our example from before, the input tokens for candidate Q90 now looks like this: + + +* ['CLS', 'paris', 'paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', '.', 'SEP' +'paris', 'paris', 'is', 'the', 'capital', 'and', 'most', 'populous', 'city', 'of', 'france', ',', 'with', 'an', 'estimated', 'population', 'of', '2', ',', '175', ',', '60', '##1', 'residents', 'SEP'] + +Note the extra tokens 'paris' and 'paris'. + +
+ +Because a mention and candidate title is frequently represented by multiple tokens, we always take the embedding of the **first of these tokens** as input to the classification layers. + +Model 5 in the table below uses this third option with tree input token embeddings to two fully connected classification layers. + + +### Prototype Results + +After training on the smaller dataset, we test the models on the full CoNLL test set. The results of five prototyped models can be seen in the table below. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model #Classif. inputClassif. layersClassif. parameters
(excl. bias)
Accuracy*
1'CLS' tokenOne output layer76859.55 %
2'CLS' tokenOne hidden layer,
one output layer
768 × 768
     + 768 = 590,592
86.02 %
3All 512 tokensOne output layer512 × 768
     + 768 = 393,984
85.16 %
4All 512 tokensOne hidden layer,
one output layer
512 × 768 × 768
     + 768 = 301,990,656
95.76 %
5'CLS' token
+ Mention token
+ Candidate token
One hidden layer,
one output layer
3 × 768 × 768
     + 768 = 1,770,240
95.36 %
+ + + +Model 1, with just a few hundred trainable parameters, already shows a performance that is better than guessing. This gives us an indication that the pre-trained BERT network that we are using already makes a good representation of the data in the 'CLS' token. This is confirmed by the fact that Model 2 improves the performance by 26 percentage points by using only the 'CLS' token. + +Comparing Model 1 to Model 2 and Model 3 to Model 4, we see that the extra hidden layer gives significant performance boosts. This is not surprising, seeing as this leads to a huge increase in the models' representational capacity. We know from the [Universal Function Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) that any function can be represented by a network with one hidden layer. + +It is worth noting that Model 3 with around four hundred thousand parameters performs on par with Model 2 with around six hundred thousand parameters and a hidden layer. Clearly, the 'CLS' token embedding from the pre-trained BERT does not contain all the information necessary to perform well on the task. This could of course change if we also fine-tuned the BERT network. + +For these five models, more parameters is consistently better. However, when we compare the enormous Model 4 to the moderate Model 5, we see a limit to increasing model size: Model 4 has around 170 times as many classification parameters as Model 5, but only performs slightly better. + +We conclude that **Model 5 shows the best potential**, and continue with that architecture from now on. To sum up Model 5: + +1. We prepend the mention to Sequence A and the candidate name to Sequence B of the input; +2. we use the final encoder's embedding of the 'CLS' token, the first mention token, and the first candidate token as input to the classification layers; and +3. we use a hidden layer with input size 3 × 768 and output 768, and an output layer with 768 input and one output. + + +## Fine-tuning + +Continuing with Model 5, which is the model we dub "BERT NED", we can now look at how fine-tuning parts of the BERT model improves our performance. As a reminder, the two classification layers have around 1.8 million trainable weights. Each BERT encoder has around 7.1 million trainable weights. We **unfreeze four encoders**, for a total of around **30 million trainable weights**. + +
+Further reading: Computing the number of weights in an encoder. +The self-attention mechanism in each encoder requires four matrices. They are dubbed "Key", "Query", "Value" and "Output", and are all 768 × 768. Following the attention are two fully connected layers that give the final output vector. The first layer has dimension 768 × 3072 and the final layer 3072 × 768. The total number of trainable parameters (excluding around seven thousand bias parameters) is: + +\\[4 \times 768 \times 768 + 768 \times 3072 + 3072 \times 768 = 7,077,888\\] + +
+ + + +### Notes on the Training Data + +We now train on the **full CoNLL training set**, with 946 CoNLL documents. These contain 18,330 named entities with annotations. The candidate generator fails to find any candidates for 3,379 mentions, and for another 4,145 mentions the correct candidate is not in the list of candidates. We still train on mentions without a ground truth candidate. From a total of 353,203 candidates, we have 10,815 ground truth candidates and 342,397 incorrect candidates. + +The dataset is quite unbalanced, with around 32 times as many "negative" examples (where the label is "False") as "positive" examples (where the label is "True"). If the model only ever predicted that candidates and mention do not match (i.e. it predicts the label "False"), it would be right for 96,9 % of the candidates, but not find a single correct candidate. We want to force the model to prioritize finding the ground truth candidates as well. The solution is to give a higher loss-penalty for predicting wrong when the candidate is correct (the "positive" examples). In practice, we simply multiply the loss of those predictions with the ratio of negative to positive labels: \\(\frac{342397}{10815} \approx 32\\). + + +### Training Parameters +The loss function is a binary cross entropy loss, which is the default loss function for binary classification. We are using an Adam optimizer, with weight decay for regularization. The learning rate follows a cosine annealing schedule, with an initial learning rate of \\(2\times 10^{-5}\\). We use a batch size of 24. + +In the BERT paper, Devlin et al. suggest three epochs of fine tuning for most tasks. To make sure we squeeze the potential out of our model, we train for up to five epochs, with early stopping when the accuracy on the validation data deteriorates. + + +# 5. Evaluation {#evaluation} + +After training the BERT NED model on CoNLL training data, it firstly makes sense to look at the model's performance on the CoNLL test data. + + +## Accuracy + +There are multiple ways of measuring the accuracy of the model. For one, we could look at only the mentions where **the ground truth (GT) is among the proposed candidates**. This gives us three types of predictions: + +1. Correct (True Positive): BERT NED picks the ground truth candidate. +2. Wrong (False Positive): The model picks the wrong candidate. +3. Wrong (False Negative): BERT picks **no candidate**, but the correct entity **is** in the list of candidates. + +A second option is to also include the cases where **the ground truth is not in the list of candidates**. If we include this second category of mentions, we have two additional types of predictions: + +4. Correct (True Negative): The correct entity **is not** in the list of candidates, and the model correctly picks **no candidate**. +5. Wrong (False Positive): The correct entity **is not** in the list of candidates, but BERT wrongly **picks a candidate**. + +The **"confusion matrix"** of a classification model shows how the model performs in these different categories. It is a table with the **true label** in the columns, and the **prediction** of the model in the rows. + +When we look at **all five types of predictions** from above, the confusion matrix has two columns: "the ground truth **is** in the list of candidates" and "the ground truth **is not** in the list of candidates". +Along the rows, we have two cases: "the model picks a candidate", and "the model does not pick a candidate". The case where the model picks a candidate is then split between picking the correct candidate and a wrong candidate. + +The resulting confusion matrix is as follows: + + + + + + + + + + + + + + + + + + + + + +
+
Actual
class
+
Predicted
+
GT in candidates + GT not in candidates +
Picks candidate + +
Picks GT
(True Positive)
+
Picks wrong
(False Positive)
+
False Positive +
Picks none + False Negative + True Negative +
+ +
+ +If we fill in the confusion matrix of BERT NED with the **results on the CoNLL test set**, we get the following: + + + + + + + + + + + + + + + + + + + + + +
+
Actual
class
+
Predicted
+
GT in candidates + GT not in candidates +
Picks candidate + +
2129 (TP)
+
27 (FP)
+
16 (FP) +
Picks none + 59 (FN) + 720 (TN) +
+ +We use the following standard formula to calculate the prediction **accuracy** from the numbers in the confusion matrix: + +\\[\frac{\text{True Positive} + \text{True Negative}}{\text{True Positive} + \text{True Negative} + \text{False Positive} + \text{False Negative}}\\] +\\[ = \frac{2129 + 720}{2129 + 720 + 27 + 59 + 16} = 96.54\text{ %}\\] + + + +
+Further reading: Evaluating different thresholds + +The output of BERT NED generally falls in the range of -15 to 15. If we apply the logistic sigmoid function to the output, we squish it to a value between 0 and 1. If the model outputs 0, the logistic sigmoid function evaluates to 0.5 (50 %). In other words, the model is maximally uncertain between "is the right candidate" and "is not the right candidate". + +Though this seems like an obvious threshold, we may want to **include some results where BERT NED is less certain**. We define a new model with a confidence threshold of 0.25 (25 %). This model gives us the following confusion matrix on CoNLL test: + + + + + + + + + + + + + + + + + + + +
+
Actual
class
+
Predicted
+
GT in candidates + GT not in candidates +
Picks candidate + +
2144 (TP)
+
28 (FP)
+
62 (FP) +
Picks none + 43 (FN) + 674 (TN) +
+ +By comparing this confusion matrix with the confusion matrix of the original model, we see that it has more True Positive predictions (2129 to 2144). This is because this model is **bolder in making a prediction**. We also see that the number of True Negatives decreases (720 to 674). Indeed, the gain in True Positive prodictions is lower than the lost True Negative predictions, and the overall accuracy of the model goes down from the 96.54 % of the previous model: + +\\[\frac{2144 + 674}{2144 + 674 + 28 + 43 + 62} = 95.49\text{ %}\\] + + + +We conclude that the original **threshold of 0.5 confidence performs better** than the second threshold of 0.25, and we keep the original model. + +
+ + +## Results + +Because we have relied on external components for Named Entity Recognition and candidate generation, BERT NED's performance can only get as good as these preceding components. In the results table below, we compare the model to **two baseline models**. Both of these models use the same modules for NER (the spaCy language model) and candidate generation (the spaCy KnowledgeBase object) as BERT NED. The only difference between these models and our system is the NED module, so we can directly compare BERT NED's performance. + +The first model, the **"Prior linker"**, is fairly simple: it always picks the candidate with the highest prior probability. This is akin to picking the candidate that is most frequent in the texts the model has seen. + +The second model, the **"spaCy linker"**, uses the [default entity linking pipeline](https://spacy.io/api/architectures#EntityLinker) from spaCy. The Wikipedia abstracts of entities in the knowledge base are used to make embeddings: all the words in a Wikipedia abstract are embedded by the spaCy language model, 'en_core_web_lg', and the average of the word embeddings gives the final embedding for that knowledge base entity. The "spaCy linker" model has been trained in a self-supervised way on at least 90,000 Wikipedia articles, where hyperlinks to other Wikipedia articles were used as mentions. + +The two benchmark models always pick a candidate. That means they are not able to predict that the ground truth is not among the candidates. For that reason, we only report the accuracy of the model in picking the right candidate **when the ground truth is in the list of candidates**. This is equivalent to the accuracy in the **left column of the confusion matrix**, and not the same accuracy as we calculated for the model above. + +
+ + Further reading: About the benchmark datasets. + + + +The CoNLL dataset has a validation set for use during training ("test-a"), and a proper test dataset that is only used to check performance after training ("test-b"). However, because we have not used the validation set for any purpose other than early stopping between epochs, we report the results on both these datasets. + +In addition to the two CoNLL test sets, we report the results on **three other datasets**. The model is **not further trained** on any data from these training sets. Using datasets that are distinct from CoNLL can give an idea of how the model generalizes outside this domain. + +The "Wikipedia" dataset is an unpublished hand annotated dataset of 40 Wikipedia articles with 738 annotated mentions. The smaller "ACE-2004" dataset has 57 documents with 306 mentions. The "MSNBC" dataset has 20 news articles with 756 mentions from MSNBC News. + + +
+ +| Model | CoNLL test | CoNLL dev | Wikipedia | ACE-2004 | MSNBC | +| ---- | ---- | ---- | ---- | ---- | ---- | +| BERT NED | **96.12 %** | **96.55 %** | 89.45 % | **92.64 %** | **90.97 %** | +| Prior linker | 85.64 % | 88.75 % | **89.87 %** | 91.41 % | 74.77 % | +| spaCy linker | 83.75 % | 87.02 % | 87.34 % | 82.21 % | 69.68 % | + +We see that BERT NED outperforms the other two models in all domains except Wikipedia. The spaCy linker is trained on Wikipedia articles, and the prior linker uses priors from Wikipedia. Comparing them on the CoNLL datasets is also unfair, as this is the type of data that BERT NED is trained on. However, neither of the models have seen data from "ACE-2004" and "MSNBC" during training, and BERT NED shows a better performance on both datasets. + + + + + + + + + + + + + +# 6. Reproducing the Results {#reproduce} + +With access to GPU architecture, you can reproduce the results in this article. The code for training a model can be found in [this repository](https://github.com/amundfr/bert_ned). It includes a Dockerfile to **reproduce the environment** for the scripts. To build the Docker image: + +```shell +docker build -t bert_ned . +``` + +If you are running this on a machine from the Algorithms and Data Structures (AD) group at the University of Freiburg, use 'wharfer' instead of 'docker': + +```shell +wharfer build -t bert_ned . +``` + +When running the Docker container, you will need to mount volumes with the **prerequisite data**. This data can be found in '/nfs/students/matthias-hertel/wiki_entity_linker/' on any AD machine. + +If you cannot access the data, you will need to reproduce it. The system requires a spaCy vocabulary, a spaCy KnowedgeBase with Wikidata QIDs, an annotated version of CoNLL with Wikidata IDs, and a mapping of Wikidata QID to Wikipedia abstracts. + +By mounting a directory for the data that the program generates, the system can take shortcuts later. A directory for the trained model is also a good idea. This is an example command to run the container: + +```shell +docker run -v /nfs/students/matthias-hertel/wiki_entity_linker:/bert_ned/ex_data \ +-v /some/local/directory/with/data:/bert_ned/data \ +-v /some/local/directory/with/models:/bert_ned/models -it bert_ned +``` + +Note that we are mounting to /bert_ned/ex_data for external data, /bert_ned/data for internal data, and /bert_ned/models for the model. If you want to change the paths used in the container, simply edit the file and directory paths in the 'config.ini' file. + +Inside the container, the script at 'bert_ned_full_pipeline.py' contains all the steps. This script is governed by the settings in the 'config.ini' file. Run the script inside the container with the following command: + +```shell +python bert_ned_full_pipeline.py +``` + +In order to train a model, you will need a Cuda-enabled GPU with sufficient working memory (GPU RAM). The model in this project was trained on a Nvidia Titan X Pascal GPU with 12 GB memory. Finally, you only need patience: the training takes a few hours per epoch. + + +# 7. Summary and Future Work {#summary} + + +In this article, we have shown that the contextualized token embeddings from BERT form a good basis for Named Entity Disambiguation. With a neural network trained on classifying matching pairs of mentions and candidates, we achieve an accuracy of over 96 % in candidate selection. + +Future work should be focused on integrating Named Entity Recognition or candidate generation with BERT NED. By formulating a greater part of the problem as one neural network model, we draw on the advantages of end-to-end learning (learning all the steps of the pipeline simultaneously as one neural network) and representation learning (learning to extracty features from raw data). + + + diff --git a/static/img/project-ned-with-bert/BERT_NED_banner.png b/static/img/project-ned-with-bert/BERT_NED_banner.png new file mode 100644 index 0000000..bd8ca42 Binary files /dev/null and b/static/img/project-ned-with-bert/BERT_NED_banner.png differ diff --git a/static/img/project-ned-with-bert/BERT_NED_illustration.png b/static/img/project-ned-with-bert/BERT_NED_illustration.png new file mode 100644 index 0000000..98c4e1f Binary files /dev/null and b/static/img/project-ned-with-bert/BERT_NED_illustration.png differ diff --git a/static/img/project-ned-with-bert/attention_example_paris.png b/static/img/project-ned-with-bert/attention_example_paris.png new file mode 100644 index 0000000..69d809b Binary files /dev/null and b/static/img/project-ned-with-bert/attention_example_paris.png differ diff --git a/static/img/project-ned-with-bert/general_bert_architecture.png b/static/img/project-ned-with-bert/general_bert_architecture.png new file mode 100644 index 0000000..4501bfd Binary files /dev/null and b/static/img/project-ned-with-bert/general_bert_architecture.png differ diff --git a/static/img/project-ned-with-bert/general_bert_ned_architecture.png b/static/img/project-ned-with-bert/general_bert_ned_architecture.png new file mode 100644 index 0000000..b22c21a Binary files /dev/null and b/static/img/project-ned-with-bert/general_bert_ned_architecture.png differ