[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

Snowdar · 2023-05-17T10:35:22Z

I want to expand my vocab from 32000 to 32002 by adding two special tokens: '<start>' and '<end>'. However, after I hacked the BPE model (appended two user-defined tokens), modified the size of relative weights of the model, and converted it to a .ggml file, I found that the special tokens were still split into several token IDs. Adjusting the scores of the special tokens did not work. I do not want to insert the special tokens between 0 to 32000 to keep the original order of tokens, so I have not tested inserting the special words at different positions.

In the original SentencePiece tokenizer, it will also not work when the type of special token is not user-defined https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto. So I want to ask whether llama_cpp considers supporting user-defined tokens? (Before this, an option is to use sentencepiece of python and llama-cpp-python to inference)

Thanks for your reply.

(By the way, the Llama tokenizer (BPE) was trained by adding an add_dummy_prefix option, so do not directly use the add_special_tokens function of the tokenizer of hugging face transformers in your training. Because it will first cut the entire sentence according to the special token, and then throw the remaining parts to the bpe model, so spaces will be added at the beginning of each part by the bpe model, which will be inconsistent with the token during llama_cpp inference.)

FNsi · 2023-05-17T12:39:34Z

Can you tell me how you removed the extra tokens from that model?

ShadowPower · 2023-05-17T13:51:02Z

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1
It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation.
According to you, maybe llama.cpp can't support models like this one right now...

JWNoctis · 2023-05-18T01:32:10Z

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation. According to you, maybe llama.cpp can't support models like this one right now...

Unless I've misunderstood, there are a lot more than that - I think even Vicuna 1.1 added a </s> or what they called (another?) EOS token to delineate between generations. OpenAssistantadded more than a few.

Snowdar · 2023-05-18T01:46:17Z

Can you tell me how you removed the extra tokens from that model?

I did not remove any tokens but appended extra tokens. If you want, firstly, you should modify the BPE model by changing the pieces field of the SentencePiece model proto class (maybe it should be careful). Secondly, there are two weights, model.embed_tokens.weight and lm_head.weight, relative to the vocab_size, so just modify these two tensors and make sure the indices are consistent with the bpe tokens.

Snowdar · 2023-05-18T01:59:01Z

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation. According to you, maybe llama.cpp can't support models like this one right now...

Yes, the added special tokens will still be split into several parts. And the Ziya-LLaMA-13B-v1 model added the special tokens at the Hugging Face Transformers tokenizer level rather than at the BPE level. Therefore, when using llama_cpp to conduct inference, it will be not consistent with the tokenization during training for the add_dummy_prefix option from the initial Llama BPE model.

Snowdar · 2023-05-18T02:30:36Z

Now there is a model that uses the add_special_tokens function: https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1 It adds two tokens: <human> and <bot> to distinguish between text from user input and AI generation. According to you, maybe llama.cpp can't support models like this one right now...

Unless I've misunderstood, there are a lot more than that - I think even Vicuna 1.1 added a </s> or what they called (another?) EOS token to delineate between generations. OpenAssistantadded more than a few.

The difference is：
original special symbol => more than one token when tokenized, e.g. <start> => [1, 2, 3]
wanted special symbol => only one token, e.g <start> => [1]
Moreover, the </s> is just a special token at the level of the Hugging Face transformers tokenizer. You cannot tokenize it into the target token-id:2 even by directly using bpe.encode('</s>'). By the way, llama_cpp inference still work most of the time because our context does not generally include </s>, </s>, and <unk> and detokenization is always okay. However, there is a problem when we use a user-defined special token like <start> in our context. Of cource, it is not a problem when using the original special symbol method except for the number of tokens of special symbol tokenized.

FNsi · 2023-05-18T15:34:19Z

For what I know, (not in llama.cpp) there's a key sequence be controlled like 'prefix' + '?' I guess you can insert your token inside it. But I forgot the file's name...

jxy · 2023-07-21T05:51:45Z

Special tokens in textual forms complicate things a lot, especially when they need to be escaped in strings where you don't intend them to be special tokens.

I created #2306 to allow direct token input to server. This way, only the frontend that generates the json need to know the special tokens and to handle them correctly.

JohanAR · 2023-11-03T19:17:04Z

Is this resolved by #3538 ?

github-actions · 2024-04-09T01:09:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Snowdar changed the title ~~How to support special tokens in tokenizer of llama_cpp?~~ [Feature request] How to support special tokens in tokenizer of llama_cpp? May 17, 2023

JWNoctis mentioned this issue Jun 15, 2023

Vicuna 1.1 / special_tokens_map.json support #1812

Closed

Igoorx mentioned this issue Jun 19, 2023

Improve support for special tokens #1931

Closed

jxy mentioned this issue Jul 26, 2023

server: allow json array in prompt or content for direct token input #2306

Merged

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

Snowdar commented May 17, 2023 •

edited

Loading

FNsi commented May 17, 2023

ShadowPower commented May 17, 2023

JWNoctis commented May 18, 2023

Snowdar commented May 18, 2023

Snowdar commented May 18, 2023

Snowdar commented May 18, 2023

FNsi commented May 18, 2023

jxy commented Jul 21, 2023

JohanAR commented Nov 3, 2023

github-actions bot commented Apr 9, 2024

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

Comments

Snowdar commented May 17, 2023 • edited Loading

FNsi commented May 17, 2023

ShadowPower commented May 17, 2023

JWNoctis commented May 18, 2023

Snowdar commented May 18, 2023

Snowdar commented May 18, 2023

Snowdar commented May 18, 2023

FNsi commented May 18, 2023

jxy commented Jul 21, 2023

JohanAR commented Nov 3, 2023

github-actions bot commented Apr 9, 2024

Snowdar commented May 17, 2023 •

edited

Loading