Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
finetunej authored Apr 20, 2023
1 parent 1d9fa71 commit 7a80803
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Finally, the compression ratio achieved by the tokenizer is important to conside

With all these things in mind, we decided that we want our own tokenizer for the models we will train, that is better optimized for our use cases, such as storytelling.

Tokenizers are trained on data, so we started by extracting small randomized subsets from the various distinct subsets of our model training dataset and used these to evaluate the available tokenizer training approaches. Both Huggingface's tokenizers library and Google's sentencepiece support training tokenizers of different types. A preliminary investigation showed that sentencepiece's trainer is more memory efficient, although a training dataset in the low double digit gibibytes still required a compute node with 1TB of RAM to run successfully. Due to this, we decided to use
Tokenizers are trained on data, so we started by extracting small randomized subsets from the various distinct subsets of our model training dataset and used these to evaluate the available tokenizer training approaches. Both Huggingface's tokenizers library and Google's sentencepiece support training tokenizers of different types. A preliminary investigation showed that sentencepiece's trainer is more memory efficient, although a training dataset in the low double digit gibibytes still required a compute node with 1TB of RAM to run successfully. Due to this, we decided to use sentencepiece.

We originally decided on a vocabulary size of 32000, but when training Genji V2, we found that modifying an existing tokenizer to support an additional language was not a pleasant experience. As it seems likely that we will want to do similar [language transfer learning](https://blog.novelai.net/data-efficient-language-transfer-with-gpt-j-45daedaaf35a) in the future, we have decided to have our tokenizer accommodate both English and Japanese from the start. For this reason, we decided to double the vocabulary size to 64000, which then was close to filling up the available token ID space of 16 bits, so we went all the way to a vocabulary size of 65535 tokens. During tokenizer training, I carefully balanced the training data in such a way that latin alphabet tokens of a length of at least 2 characters and Japanese language tokens take up approximately the same amount of token space. Bumping the vocabulary size up to 65535 also allows more Unicode character tokens such as emoji. For the Japanese part of tokenizer training data, we used our existing Genji training data and a comparatively smaller amount of Japanese Wikipedia.

Expand Down

0 comments on commit 7a80803

Please sign in to comment.