Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the issue of being unable to handle transformer added/expanded model tokens #83

Merged
merged 1 commit into from
Mar 10, 2024

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Mar 7, 2024

For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).

ref: https://stackoverflow.com/questions/67412925/what-is-the-difference-between-lentokenizer-and-tokenizer-vocab-size
ref: huggingface/tokenizers#900 (comment)

Without this PR, any new custom tokens added to the transformer model and subsequently trained will be invisible to the lm-format-enforcer.

…. For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).
@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 7, 2024

@turboderp my pr fix only fixed tranformer tokenizer integration as I am unfamiliar with exllama tokenizer. Perhaps the same patch is also required for exllama integration depending how the exllama tokenizer normalize "vocab size"? I find the transformer current discrepancy a little strange.

@Qubitium
Copy link
Contributor Author

@noamgat Please review this bug fix. Thanks.

@noamgat noamgat merged commit fbcf5af into noamgat:main Mar 10, 2024
1 check passed
@noamgat
Copy link
Owner

noamgat commented Mar 10, 2024

Thanks for the contribution!

@JoshC8C7
Copy link
Contributor

JoshC8C7 commented Apr 11, 2024

Just a warning that this change now prohibits using models whose vocab size (normal + added tokens) is larger than its model's embedding size (for a model which hasn't been retrained to output the added tokens). Not a common case (although it is my own, as I've got extra post-tokenization-pre-inference steps) but worth noting nonetheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants