This document delves into the encode.py
script, which offers a practical implementation of the Byte Pair Encoding (BPE) algorithm. The script directly implements the BPE algorithm as outlined in the "Neural Machine Translation of Rare Words with Subword Units" paper.
In this section, we embark on a comprehensive exploration of the encode.py
script, dissecting its source code and understanding the inner workings of the Byte Pair Encoding (BPE) algorithm.
import argparse
import collections
import re
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[symbols[i], symbols[i + 1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(" ".join(pair))
p = re.compile(r"(?<!\S)" + bigram + r"(?!\S)")
for word in v_in:
w_out = p.sub("".join(pair), word)
v_out[w_out] = v_in[word]
return v_out
def main(args: argparse.Namespace) -> None:
vocab = {
"l o w </w>": 5,
"l o w e r </w>": 2,
"n e w e s t </w>": 6,
"w i d e s t </w>": 3,
}
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
Let's begin by dissecting the source code of the encode.py
script, unraveling its core components and functions:
-
Function Purpose:
- This function is responsible for calculating the frequency of each adjacent pair of symbols (bigrams) within the given vocabulary.
-
Vocabulary Structure:
- The vocabulary (
vocab
) is structured as a dictionary, where each key represents a word in the form of a space-separated string of symbols. The corresponding value denotes the frequency of that word.
- The vocabulary (
-
Function Objective:
- The
merge_vocab
function is tasked with merging all occurrences of a specified pair of symbols within the vocabulary.
- The
-
Regular Expression Utilization:
- To achieve this, the function utilizes regular expressions to ensure that the pair is merged only when the symbols are contiguous without any intervening space.
The main
function orchestrates the entire BPE merging process, consisting of the following steps:
-
Initialization:
- It initializes a sample vocabulary, representing words as sequences of characters. Additionally, it introduces a special symbol,
</w>
, to mark the end of a word.
- It initializes a sample vocabulary, representing words as sequences of characters. Additionally, it introduces a special symbol,
-
Merging Iterations:
- The script conducts a fixed number of BPE merges (
num_merges
). In each iteration, it identifies and merges the most frequent pair of symbols found within the current vocabulary.
- The script conducts a fixed number of BPE merges (
With this foundation, let's proceed to analyze the output generated by the script, shedding light on the evolving pairs of symbols during the BPE process.
The script generates output displaying the pairs of symbols merged in each iteration. Here's a sample of the output:
19:27:12 | ~/Local/byte-pair
(.venv) git:(main | Δ) λ python -m byte_pair.encode
('e', 's')
('es', 't')
('est', '</w>')
('l', 'o')
('lo', 'w')
('n', 'e')
('ne', 'w')
('new', 'est</w>')
('low', '</w>')
('w', 'i')
For the given vocabulary and a specified number of merges, the following observations can be made:
-
Initial Merges:
- In the initial iterations, the most frequent pairs are ('e', 's'), ('es', 't'), and ('est', '</w>'). These pairs reflect the high occurrence of these combinations of symbols within your sample vocabulary.
-
Progressive Merges:
- As the iterations continue, other pairs like ('l', 'o'), ('lo', 'w'), and ('n', 'e') are merged. Each merge decision is based on the current vocabulary state, evolving as pairs are progressively combined.
-
Formation of Subword Units:
- Towards the later iterations, you can observe merges like ('new', 'est</w>') and ('low', '</w>'). These signify the creation of larger subword units from previously merged smaller components.
This analysis provides insight into how Byte-Pair Encoding dynamically constructs subword units through iterative merging.
Byte-Pair Encoding (BPE) initiates with characters as its fundamental units and progressively merges the most frequently occurring adjacent pairs. This iterative approach results in the creation of increasingly common subword units.
The presence of the </w>
symbol is pivotal in BPE as it marks the end of a word. This symbol plays a crucial role in enabling the algorithm to differentiate between merging characters within words and those across word boundaries.
BPE's merging decisions are solely driven by frequency. It prioritizes merging pairs based on their occurrence frequency, making it effective in compressing vocabularies and handling rare or out-of-vocabulary words.
Consider enhancing flexibility by enabling the script to build its vocabulary dynamically from an input corpus, rather than relying on a fixed sample vocabulary.
Introduce command-line parameterization, allowing users to customize parameters like the number of merges (n_merges
) or input file paths for greater adaptability.
To facilitate tokenization and further processing, explore options for storing the series of merges or the final vocabulary in a structured format, such as a file.
In conclusion, this implementation provides a clear illustration of how Byte-Pair Encoding operates. By constructing a hierarchical structure of subword units from a basic character-level representation, BPE offers an effective approach for handling extensive vocabularies in diverse natural language processing tasks.