Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model versus tokenizer mismatch #23

Open
petergreis opened this issue Apr 22, 2024 · 1 comment
Open

Model versus tokenizer mismatch #23

petergreis opened this issue Apr 22, 2024 · 1 comment

Comments

@petergreis
Copy link

Greetings

Attempting to load a fine tuned model into llama.cpp. As the error sources from my sft ChatMusician model I post this here. I went back to try this on a saved copy of the model.

Running this:
python3 convert.py /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer

Yields this:

ValueError: Vocab size mismatch (model has 32000, but /Users/petergreis/Dropbox/Leeds/Project/chatmusician_model_tokenizer/tokenizer.model has 32001).

And in the model directory itself I see:

 % more added_tokens.json
{
  "<pad>": 32000
}

Which explains why the token count is off by one. Any idea how I can get the two to agree?

@petergreis petergreis changed the title Model versus tokeniser mismatch Model versus tokenizer mismatch Apr 22, 2024
@hf-lin
Copy link
Owner

hf-lin commented Apr 28, 2024

I find the same issue:ggerganov/llama.cpp#4045 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants