-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501
Comments
Can you tell me how you removed the extra tokens from that model? |
Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 |
Unless I've misunderstood, there are a lot more than that - I think even Vicuna 1.1 added a |
I did not remove any tokens but appended extra tokens. If you want, firstly, you should modify the BPE model by changing the pieces field of the SentencePiece model proto class (maybe it should be careful). Secondly, there are two weights, model.embed_tokens.weight and lm_head.weight, relative to the vocab_size, so just modify these two tensors and make sure the indices are consistent with the bpe tokens. |
Yes, the added special tokens will still be split into several parts. And the Ziya-LLaMA-13B-v1 model added the special tokens at the Hugging Face Transformers tokenizer level rather than at the BPE level. Therefore, when using llama_cpp to conduct inference, it will be not consistent with the tokenization during training for the add_dummy_prefix option from the initial Llama BPE model. |
The difference is: |
For what I know, (not in llama.cpp) there's a key sequence be controlled like 'prefix' + '?' I guess you can insert your token inside it. But I forgot the file's name... |
Special tokens in textual forms complicate things a lot, especially when they need to be escaped in strings where you don't intend them to be special tokens. I created #2306 to allow direct token input to server. This way, only the frontend that generates the json need to know the special tokens and to handle them correctly. |
Is this resolved by #3538 ? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I want to expand my vocab from 32000 to 32002 by adding two special tokens: '<start>' and '<end>'. However, after I hacked the BPE model (appended two user-defined tokens), modified the size of relative weights of the model, and converted it to a .ggml file, I found that the special tokens were still split into several token IDs. Adjusting the scores of the special tokens did not work. I do not want to insert the special tokens between 0 to 32000 to keep the original order of tokens, so I have not tested inserting the special words at different positions.
In the original SentencePiece tokenizer, it will also not work when the type of special token is not user-defined https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto. So I want to ask whether llama_cpp considers supporting user-defined tokens? (Before this, an option is to use sentencepiece of python and llama-cpp-python to inference)
Thanks for your reply.
(By the way, the Llama tokenizer (BPE) was trained by adding an add_dummy_prefix option, so do not directly use the add_special_tokens function of the tokenizer of hugging face transformers in your training. Because it will first cut the entire sentence according to the special token, and then throw the remaining parts to the bpe model, so spaces will be added at the beginning of each part by the bpe model, which will be inconsistent with the token during llama_cpp inference.)
The text was updated successfully, but these errors were encountered: