Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Qubitium · 2024-03-07T02:31:08Z

For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).

ref: https://stackoverflow.com/questions/67412925/what-is-the-difference-between-lentokenizer-and-tokenizer-vocab-size
ref: huggingface/tokenizers#900 (comment)

Without this PR, any new custom tokens added to the transformer model and subsequently trained will be invisible to the lm-format-enforcer.

…. For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).

Qubitium · 2024-03-07T02:43:41Z

@turboderp my pr fix only fixed tranformer tokenizer integration as I am unfamiliar with exllama tokenizer. Perhaps the same patch is also required for exllama integration depending how the exllama tokenizer normalize "vocab size"? I find the transformer current discrepancy a little strange.

Qubitium · 2024-03-10T15:33:24Z

@noamgat Please review this bug fix. Thanks.

noamgat · 2024-03-10T18:30:17Z

Thanks for the contribution!

JoshC8C7 · 2024-04-11T14:32:16Z

Just a warning that this change now prohibits using models whose vocab size (normal + added tokens) is larger than its model's embedding size (for a model which hasn't been retrained to output the added tokens). Not a common case (although it is my own, as I've got extra post-tokenization-pre-inference steps) but worth noting nonetheless.

Fixed the issue of being unable to handle added/expanded model tokens…

c930f59

…. For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).

noamgat approved these changes Mar 10, 2024

View reviewed changes

noamgat merged commit fbcf5af into noamgat:main Mar 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Qubitium commented Mar 7, 2024 •

edited

Loading

Qubitium commented Mar 7, 2024 •

edited

Loading

Qubitium commented Mar 10, 2024

noamgat commented Mar 10, 2024

JoshC8C7 commented Apr 11, 2024 •

edited

Loading

Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Conversation

Qubitium commented Mar 7, 2024 • edited Loading

Qubitium commented Mar 7, 2024 • edited Loading

Qubitium commented Mar 10, 2024

noamgat commented Mar 10, 2024

JoshC8C7 commented Apr 11, 2024 • edited Loading

Qubitium commented Mar 7, 2024 •

edited

Loading

Qubitium commented Mar 7, 2024 •

edited

Loading

JoshC8C7 commented Apr 11, 2024 •

edited

Loading