Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

Closed
Snowdar opened this issue May 17, 2023 · 10 comments
Closed
Labels

Comments

@Snowdar
Copy link

Snowdar commented May 17, 2023

I want to expand my vocab from 32000 to 32002 by adding two special tokens: '<start>' and '<end>'. However, after I hacked the BPE model (appended two user-defined tokens), modified the size of relative weights of the model, and converted it to a .ggml file, I found that the special tokens were still split into several token IDs. Adjusting the scores of the special tokens did not work. I do not want to insert the special tokens between 0 to 32000 to keep the original order of tokens, so I have not tested inserting the special words at different positions.

In the original SentencePiece tokenizer, it will also not work when the type of special token is not user-defined https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto. So I want to ask whether llama_cpp considers supporting user-defined tokens? (Before this, an option is to use sentencepiece of python and llama-cpp-python to inference)

Thanks for your reply.

(By the way, the Llama tokenizer (BPE) was trained by adding an add_dummy_prefix option, so do not directly use the add_special_tokens function of the tokenizer of hugging face transformers in your training. Because it will first cut the entire sentence according to the special token, and then throw the remaining parts to the bpe model, so spaces will be added at the beginning of each part by the bpe model, which will be inconsistent with the token during llama_cpp inference.)

@Snowdar Snowdar changed the title How to support special tokens in tokenizer of llama_cpp? [Feature request] How to support special tokens in tokenizer of llama_cpp? May 17, 2023
@FNsi
Copy link
Contributor

FNsi commented May 17, 2023

Can you tell me how you removed the extra tokens from that model?

@ShadowPower
Copy link

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1
It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation.
According to you, maybe llama.cpp can't support models like this one right now...

@JWNoctis
Copy link

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation. According to you, maybe llama.cpp can't support models like this one right now...

Unless I've misunderstood, there are a lot more than that - I think even Vicuna 1.1 added a </s> or what they called (another?) EOS token to delineate between generations. OpenAssistantadded more than a few.

@Snowdar
Copy link
Author

Snowdar commented May 18, 2023

Can you tell me how you removed the extra tokens from that model?

I did not remove any tokens but appended extra tokens. If you want, firstly, you should modify the BPE model by changing the pieces field of the SentencePiece model proto class (maybe it should be careful). Secondly, there are two weights, model.embed_tokens.weight and lm_head.weight, relative to the vocab_size, so just modify these two tensors and make sure the indices are consistent with the bpe tokens.

@Snowdar
Copy link
Author

Snowdar commented May 18, 2023

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation. According to you, maybe llama.cpp can't support models like this one right now...

Yes, the added special tokens will still be split into several parts. And the Ziya-LLaMA-13B-v1 model added the special tokens at the Hugging Face Transformers tokenizer level rather than at the BPE level. Therefore, when using llama_cpp to conduct inference, it will be not consistent with the tokenization during training for the add_dummy_prefix option from the initial Llama BPE model.

@Snowdar
Copy link
Author

Snowdar commented May 18, 2023

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation. According to you, maybe llama.cpp can't support models like this one right now...

Unless I've misunderstood, there are a lot more than that - I think even Vicuna 1.1 added a </s> or what they called (another?) EOS token to delineate between generations. OpenAssistantadded more than a few.

The difference is:
original special symbol => more than one token when tokenized, e.g. <start> => [1, 2, 3]
wanted special symbol => only one token, e.g <start> => [1]
Moreover, the </s> is just a special token at the level of the Hugging Face transformers tokenizer. You cannot tokenize it into the target token-id:2 even by directly using bpe.encode('</s>'). By the way, llama_cpp inference still work most of the time because our context does not generally include </s>, </s>, and <unk> and detokenization is always okay. However, there is a problem when we use a user-defined special token like <start> in our context. Of cource, it is not a problem when using the original special symbol method except for the number of tokens of special symbol tokenized.

@FNsi
Copy link
Contributor

FNsi commented May 18, 2023

For what I know, (not in llama.cpp) there's a key sequence be controlled like 'prefix' + '?' I guess you can insert your token inside it. But I forgot the file's name...

@jxy
Copy link
Contributor

jxy commented Jul 21, 2023

Special tokens in textual forms complicate things a lot, especially when they need to be escaped in strings where you don't intend them to be special tokens.

I created #2306 to allow direct token input to server. This way, only the frontend that generates the json need to know the special tokens and to handle them correctly.

@JohanAR
Copy link
Contributor

JohanAR commented Nov 3, 2023

Is this resolved by #3538 ?

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants