Skip to content

Commit

Permalink
Fixed the issue of being unable to handle added/expanded model tokens…
Browse files Browse the repository at this point in the history
…. For transformers, tokenizer.vocab_size excludes all tokens added via token expansion. Correct usage here is len(tokenizer).
  • Loading branch information
Qubitium committed Mar 7, 2024
1 parent ebde917 commit c930f59
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion lmformatenforcer/integrations/transformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def unreplace_logits_warper(self):
def _build_regular_tokens_list(tokenizer: PreTrainedTokenizerBase) -> List[Tuple[int, str, bool]]:
token_0 = tokenizer.encode("0")[-1]
regular_tokens = []
for token_idx in range(tokenizer.vocab_size):
for token_idx in range(len(tokenizer)):
if token_idx in tokenizer.all_special_ids:
continue
# We prepend token 0 and skip the first letter of the result to get a space if the token is a start word.
Expand Down

0 comments on commit c930f59

Please sign in to comment.