Skip to content

Commit

Permalink
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
Browse files Browse the repository at this point in the history
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
  • Loading branch information
compilade committed Aug 22, 2024
1 parent fa358e7 commit 38913dc
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -2801,13 +2801,13 @@ def set_vocab(self):
vocab_size = -(vocab_size // -pad_vocab) * pad_vocab
self.hparams["vocab_size"] = vocab_size

if (self.dir_model / "tokenizer.json").is_file():
self._set_vocab_gpt2()
elif (self.dir_model / "tokenizer.model").is_file():
if (self.dir_model / "tokenizer.model").is_file():
self._set_vocab_sentencepiece()
elif (self.dir_model / "tokenizer.model.v3").is_file():
# mamba-codestral
raise NotImplementedError(f"Please rename {self.dir_model / 'tokenizer.model.v3'} to {self.dir_model / 'tokenizer.model'}")
elif (self.dir_model / "tokenizer.json").is_file():
self._set_vocab_gpt2()
else:
# Use the GPT-NeoX tokenizer when no tokenizer files are present
self._set_vocab_builtin("gpt-neox", vocab_size)
Expand Down

0 comments on commit 38913dc

Please sign in to comment.