diff --git a/README.md b/README.md index 90d1c4d..0191e8d 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ Finally, the compression ratio achieved by the tokenizer is important to conside With all these things in mind, we decided that we want our own tokenizer for the models we will train, that is better optimized for our use cases, such as storytelling. -Tokenizers are trained on data, so we started by extracting small randomized subsets from the various distinct subsets of our model training dataset and used these to evaluate the available tokenizer training approaches. Both Huggingface's tokenizers library and Google's sentencepiece support training tokenizers of different types. A preliminary investigation showed that sentencepiece's trainer is more memory efficient, although a training dataset in the low double digit gibibytes still required a compute node with 1TB of RAM to run successfully. Due to this, we decided to use +Tokenizers are trained on data, so we started by extracting small randomized subsets from the various distinct subsets of our model training dataset and used these to evaluate the available tokenizer training approaches. Both Huggingface's tokenizers library and Google's sentencepiece support training tokenizers of different types. A preliminary investigation showed that sentencepiece's trainer is more memory efficient, although a training dataset in the low double digit gibibytes still required a compute node with 1TB of RAM to run successfully. Due to this, we decided to use sentencepiece. We originally decided on a vocabulary size of 32000, but when training Genji V2, we found that modifying an existing tokenizer to support an additional language was not a pleasant experience. As it seems likely that we will want to do similar [language transfer learning](https://blog.novelai.net/data-efficient-language-transfer-with-gpt-j-45daedaaf35a) in the future, we have decided to have our tokenizer accommodate both English and Japanese from the start. For this reason, we decided to double the vocabulary size to 64000, which then was close to filling up the available token ID space of 16 bits, so we went all the way to a vocabulary size of 65535 tokens. During tokenizer training, I carefully balanced the training data in such a way that latin alphabet tokens of a length of at least 2 characters and Japanese language tokens take up approximately the same amount of token space. Bumping the vocabulary size up to 65535 also allows more Unicode character tokens such as emoji. For the Japanese part of tokenizer training data, we used our existing Genji training data and a comparatively smaller amount of Japanese Wikipedia.