Model versus tokenizer mismatch #23

petergreis · 2024-04-22T14:29:55Z

Greetings

Attempting to load a fine tuned model into llama.cpp. As the error sources from my sft ChatMusician model I post this here. I went back to try this on a saved copy of the model.

Running this:
python3 convert.py /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer

Yields this:

ValueError: Vocab size mismatch (model has 32000, but /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer/tokenizer.model has 32001).

And in the model directory itself I see:

 % more added_tokens.json
{
  "<pad>": 32000
}

Which explains why the token count is off by one. Any idea how I can get the two to agree?

The text was updated successfully, but these errors were encountered:

hf-lin · 2024-04-28T15:46:32Z

I find the same issue：ggerganov/llama.cpp#4045 (comment)

petergreis changed the title ~~Model versus tokeniser mismatch~~ Model versus tokenizer mismatch Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model versus tokenizer mismatch #23

Model versus tokenizer mismatch #23

petergreis commented Apr 22, 2024

hf-lin commented Apr 28, 2024

Model versus tokenizer mismatch #23

Model versus tokenizer mismatch #23

Comments

petergreis commented Apr 22, 2024

hf-lin commented Apr 28, 2024