Skip to content

Commit

Permalink
tokenizer : BPE fixes (#7530)
Browse files Browse the repository at this point in the history
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
  • Loading branch information
jaime-m-p authored Jun 18, 2024
1 parent 91c188d commit 37bef89
Show file tree
Hide file tree
Showing 5 changed files with 1,285 additions and 1,055 deletions.
Loading

0 comments on commit 37bef89

Please sign in to comment.