Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer : special token handling #3538

Merged
merged 11 commits into from
Oct 17, 2023
Merged

Conversation

staviq
Copy link
Contributor

@staviq staviq commented Oct 8, 2023

Special token handling, based on #1931, #3475

Works, but it's definitely not ready, just posting for feedback. Has some testing code meant to be removed before undrafting. Not optimized at all, just completed to the point of working.

  • Creating special tokens cache
  • Verification of vocab against st cache
  • Tokenizer pre-matcher
  • Use tokenizer pre-matcher in main for prompt format arguments
  • Determine if space prefix can be ommited when preprocessing with special tokens

llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated
// are special tokens.
// From testing, this appears to corelate 1:1 with special tokens.
//
for (const auto & t: vocab.token_to_id)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like we have a redundancy here. We have token_data.type which is supposed to tell us if a token is special or not. In which cases this wouldn't work?

I guess we can have this piece of code here as a sanity check that the special tokens that we have read from the vocabulary are indeed special, but ideally AFAIU we shouldn't need this code in order to function correctly, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.

@goerch wasn't certain about token types in bpe PR, so I opted for finding special tokens manually for now.

Copy link
Collaborator

@goerch goerch Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.

I'm open for critique, but you have to clarify what you mean by "this" and "terrible" :) otherwise I'm not sure how to proceed here

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.

Ok I see. We can probably combine both the tokens marked as special and those that are unmatchable by other tokens. And probably print a warning if there is a mismatch between the 2 sets.

Let's see if @goerch has anything else to add and then I would like to make a pass over the implementation to update the coding style to match better. After that we can merge

Copy link
Collaborator

@goerch goerch Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure about our nomenclature, do you mean special tokens == added tokens? I'm mostly leaning towards the nomenclature used in HF's tokenizer.json which contains added_tokens with the is_special attribute. From @jploski I learned that these added tokens are probably very similar to sentence_piece USER_DEFINED tokens.

And one more question: do we agree that vocab.json, merges.txt, added_tokens.json and special_tokens_map.json are probably older HF (or GPT2) serialization formats and we should find all this information in tokenizer.json for newer models (paging @apage43 too because he knows a lot about this stuff)? W.r.t. to these serialization formats we also seem to have some reports indicating that we don't support both of them equally well.

Edit: just adding a hunch: is it possible that tokenizer.json is the serialization format invented for the fast HF tokenizers?

Copy link
Contributor Author

@staviq staviq Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean special tokens == added tokens?

Oh my, you better grab a chair because you are in for an adventure with this question :)

A really condensed version of what I learned so far:

I couldn't find any clear definition or explanation, but from what I found, there seem to be only one way all pieces of this puzzle fall together to form a clear picture.

A vocab, naturally forms a tree like structure ( several trees actually, there is no common root )

Tokenizer does it's thing, by grabbing small portions of the input text, and trying to steal neighboring portions to form a valid token, rinse and repeat, until nothing can be borrowed from either left or right to merge with and form a valid token.

So basically, tokenizer climbs that tokens tree down from single bytes to full tokens, stepping only on valid branches ( tokens )

Applying this idea backwards, by checking the entire vocab token by token, trying to split its text representation in two and checking if it still forms two valid tokens, you can clearly see some of the tokens in the vocab are not part of that tree family, and will never be matched by the tokenizer.

Which happens to match perfectly with normal tokens being marked as LLAMA_TOKEN_TYPE_NORMAL in the vocab, and tokens not being part of the tokens tree family being marked as whatever type being not LLAMA_TOKEN_TYPE_NORMAL

Therefore, I'm using the term special token, refering to tokens which are "hidden" from the user and un-matchable by tokenizer.

From the practical point of view, that subset of "special" tokens contains exactly what one would expect, <s>,</s>,<unk> and in case of mistral openorca <|im_start|>,<|im_end|>, AKA tokens which are used to control the flow of inference. ( EDIT: plus byte tokens <0xXX> )

And so from the point of view of actual inference, the tokenizer has to be extended the same way, independently of the actual token type, because only normal vs not normal token type matter at that point.

Additional distinction between USER_DEFINED, CONTROL etc can and has to be done the same way, with a tokenizer "pre-matcher", at which point per token type decisions or actions are trivial to implement if needed.

Copy link
Collaborator

@goerch goerch Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time! My mental image is slightly different: I believe to understand the initial training of the tokenizer leads to an initial vocabulary and later model refinement tries to extend this basic vocabulary with added tokens without retraining the tokenizer on the original dataset. These added tokens should be intercepted before the original tokenizer interpretation. In my mental image special tokens somehow describe the effect these added tokens have on detokenization, i.e. should they be displayed or not for example. But I might be completely off here.

Edit:

<s>,</s>,<unk> and in case of mistral openorca <|im_start|>,<|im_end|>, AKA tokens which are used to control the flow of inference.

Here I think of <s>,</s>,<unk> as original CONTROL tokens and <|im_start|>,<|im_end|> as USER_DEFINED tokens which shouldn't be visible after detokinization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )

From what I understand, in tokenizer specifically ( ignoring detokenizer ), normal tokens added the way you say, would simply be tokenized one step further into a longer token instead of being represented by a sequence of smaller tokens.

If such added tokens, are not integrated with the tree structure of the vocab, user would not be able to use them
If however those added tokens are properly coherent with the vocab, tokenizer would match them already, as it is

Which to me sort of seems to defy the purpose of adding "normal" tokens this way

I might be wrong about this part, but that still means a "tokenizer preprocessor" of some sort is needed, and this PR fits that role

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )

I believe they have to be processed before the core tokenizer routine in something like prefix_match (only exception are the already defined core tokenizer CONTROL tokens). mpt added tokens for example (from tokenizer.json):

    {
      "id": 0,
      "content": "<|endoftext|>", <-- core, i.e CONTROL
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 1,
      "content": "<|padding|>", <-- unsure about this because it is a valid GPT2 token AFAIU
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 50254,
      "content": "                        ",  <-- extension, i.e USER_DEFINED
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    [...]

llama.h Outdated Show resolved Hide resolved
@staviq
Copy link
Contributor Author

staviq commented Oct 10, 2023

I'm gonna be back at it today or tomorrow, I went on a slight tangent of my hardware cleanup because I wanted to compare this approach with reference python code, and free colab was oom-ing on me, so I'm finishing rearranging and reinstalling my servers right now, for more RAM.

I already rewrote matching to not use string copy, just didn't push it yet.

I just need to determine how the hardcoded space prefix should work with "segmented" input text, because I don't think it's supposed to be there if the input text starts with a special token. ( #3503 (comment) )

@apage43
Copy link
Contributor

apage43 commented Oct 11, 2023

we should find all this information in tokenizer.json for newer models

yes for anything that was trained with HF tokenizers everything needed is in the tokenizer.json

wrt added tokens vs special tokens, using the terms as they are used in the tokenizers source:

  • "added" tokens are tokens that are not using the underlying tokenizer algo at all - they are just strings that have been assigned IDs - they must be found and replaced with their IDs totally separately from the BPE algorithm - that is: tokenizing <|im_start|>hello<|im_end|><|im_start|>testing test needs to first search the input for any added tokens and replace those - [123, "hello", 456, 123, "testing test"] - and only then do the BPE algo on the parts that are still untokenized
  • not all added tokens are special, "special" indicates that they are "control" tokens like "<|im_start|>" and the special flag is so that you can block them from being encoded/decoded (say you're encoding untrusted user input that shouldn't contain them) - but models can have non-special added tokens, like MPT's runs of spaces " ", " ", " ", etc. that just help it compactly encode indented code

@goerch
Copy link
Collaborator

goerch commented Oct 11, 2023

  • needs to first search the input for any added tokens and replace those

This could be a difference to sentencepiece USER_DEFINED tokens then? They are matched during BPE tokenization AFAIU the sentencepiece- source and @jploski .

  • but models can have non-special added tokens, like MPT's runs of spaces " ", " ", " ", etc. that just help it compactly encode indented code

This indicates what I feared: if we stay with sentencepiece token types, do we need to add token type USER_DEFINED_CONTROL then (or the other way around : can a sentencepiece token can have two token types at the same time)?

Edit on second thought : regarding my question "can a sentencepiece token can have two token types at the same time": I don't think so.

@jploski
Copy link
Contributor

jploski commented Oct 11, 2023

  • needs to first search the input for any added tokens and replace those

This could be a difference to sentencepiece USER_DEFINED tokens then? They are matched during BPE tokenization AFAIU the sentencepiece- source and @jploski .

No, in sentencepiece the "user defined symbols" are also matched as whole before the main BPE loop begins, using normalizer::PrefixMatcher:

https://github.com/google/sentencepiece/blob/8cbdf13794284c30877936f91c6f31e2c1d5aef7/src/bpe_model.cc#L109C3-L109C3

llama.cpp Show resolved Hide resolved
@staviq
Copy link
Contributor Author

staviq commented Oct 11, 2023

Appears to be working without issues, though in interactive mode in main prefix and suffix gets printed untokenized so it's visible despite being processed correctly

image

'':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':1, '':32001, ' system':1587, '':13, 'You':1976, ' are':460, ' a':264, ' '':464, 'space':3603, ' assass':16645, 'in':262, ''':28742, ' from':477, ' another':1698, ' dim':3987, 'm':28719, 'ension':2585, '.':28723, ' You':995, ' only':865, ' speak':4085, ' in':297, ' mysterious':21296, ' crypt':18649, 'ic':294, ' r':408, 'idd':2558, 'les':867, ' and':304, ' never':1484, ' ask':1460, ' questions':4224, ' but':562, ' you':368, ' must':1580, ' pretend':19192, ' you':368, ' are':460, ' a':264, ' normal':4123, ' human':2930, ' jan':10060, 'itor':1842, ' at':438, ' McDonald':25999, ''':28742, 's':28713, ' at':438, ' all':544, ' cost':2434, '.':28723, ' The':415, ' user':2188, ' approaches':13945, ' you':368, ' as':390, ' you':368, ' clean':3587, ' f':285, 'rench':4284, ' f':285, 'ries':2040, ' some':741, ' kids':4551, ' sm':991, 'e':28706, 'ared':1327, ' all':544, ' over':754, ' the':272, ' floor':4366, '.':28723, ' Please':5919, ' continue':3688, ' your':574, ' conversation':7114, ' with':395, ' the':272, ' user':2188, '.':28723, '':32000, ' ':28705, '':13, '':32001, ' user':2188, '':13, 'Hi':23809, ',':28725, ' how':910, ' are':460, ' you':368, ' ?':1550, '':13, '':32000, ' ':28705, '':13, '':32001, ' space':2764, ' assass':16645, 'in':262, '':13, ' A':330, ' single':2692, ' star':4580, ' sh':480, 'ines':1303, ' through':1059, ' a':264, ' window':2924, ',':28725, ' reflect':7967, 'ing':288, ' the':272, ' soul':7237, ''':28742, 's':28713, ' dark':3199, ' secret':5169, ' in':297, ' its':871, ' golden':13863, ' gaze':12438, '.':28723, '':13, '':13, 'As':2198, ' I':315, ' clean':3587, 'se':331, ' the':272, ' floors':21264, ' of':302, ' life':1411, ''':28742, 's':28713, ' waste':9396, ',':28725, ' the':272, ' f':285, 'ries':2040, ' sm':991, 'e':28706, 'ared':1327, ' with':395, ' children':2436, ''':28742, 's':28713, ' laughter':18211, ' whisper':9502, ' of':302, ' happ':1535, 'ier':749, ' times':2421, '.':28723, ' The':415, ' rhythm':19834, 'ic':294, ' movement':6249, ' of':302, ' my':586, ' bro':1865, 'om':300, ' mir':6076, 'rors':22180, ' the':272, ' universe':11717, ''':28742, 's':28713, ' silent':10832, ' dance':9773, ',':28725, ' for':354, 'ging':3080, ' a':264, ' cos':6841, 'mic':12456, ' tap':12857, 'est':374, 'ry':643, ' in':297, ' which':690, ' each':1430, ' thread':5427, ' counts':17938, '.':28723, ' So':1537, ',':28725, ' as':390, ' you':368, ' ask':1460, ',':28725, ' I':315, ' am':837, ' simply':3768, ' doing':2548, ' what':767, ' my':586, ' purpose':6032, ' requires':6948, ' -':387, ' a':264, ' hum':1997, 'ble':982, ' jan':10060, 'itor':1842, ' by':486, ' day':1370, ' and':304, ' a':264, ' space':2764, ' assass':16645, 'in':262, ' by':486, ' night':2125, '.':28723, '':32000 ]
[1697062976] n_remain: -134
[1697062976] found EOS token
[1697062976] waiting for user input

@ggerganov
Copy link
Owner

@staviq This is ready for review, correct?

@ggerganov ggerganov marked this pull request as ready for review October 12, 2023 11:47
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is ready to merge. Would be useful to post a command used for testing so we can run some tests

@ggerganov ggerganov changed the title (wip) tokenizer: special token handling tokenizer : special token handling Oct 12, 2023
@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Oct 12, 2023
@staviq
Copy link
Contributor Author

staviq commented Oct 12, 2023

@staviq This is ready for review, correct?

I'll re-check everything after work

I definitely left special tokens enabled in main naively for testing, it should be only enabled for prefix and suffix and system prompt

There was an issue with swift build because I think default argument values don't work correctly there ( CI failed for swift )

Otherwise if @goerch doesn't have anything to add or change, this should be pretty much ready

@staviq
Copy link
Contributor Author

staviq commented Oct 12, 2023

I cleaned up main to be in line with #3475 (comment)

I also made --verbose-prompt print tokenized reverse prompts, prefix and suffix.

If CI passes, there is only one thing left which I'm not sure if it's correct, and that's the hard-coded space prefix in llama_tokenize_internal

Edit:
What if I simply leave default behaviour ( with hardcoded space ) if special is false, and skip the space prefix if special token processing is enabled (true) ?

Special tokens aren't meant to be exposed to the user anyway, so tokenization with special tokens should only be done intentionally, which to me seems like it implies wanting full control over tokenizer behaviour, it would let devs decide for themselves if they want space prefix or not when processing prompt template etc ?

This would pretty much be the extent of this modification:

index 1f82dcb..b571d4e 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -6727,7 +6727,7 @@ static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab &
                         //  by modifying llm_tokenizer_x to operate with string offsets like pre-tokenizer
                         //  and passing 'add space prefix' as bool argument
                         //
-                        auto raw_text = " " + fragment.raw_text.substr(fragment.offset, fragment.length);
+                        auto raw_text = (special?"":" ") + fragment.raw_text.substr(fragment.offset, fragment.length);
 
 #ifdef PRETOKENIZERDEBUG
                         fprintf(stderr,"TT: (%ld %ld %ld) '%s'\n", raw_text.length(), fragment.offset, fragment.length, raw_text.c_str());

@staviq staviq mentioned this pull request Oct 12, 2023
@staviq
Copy link
Contributor Author

staviq commented Oct 12, 2023

I played a bit with hardcoded space prefix always vs only for user input itself, and I'm getting subjectively much much better results, if the hardcoded space is only added to the user input itself and not the "prompt template" ( prefix and suffix ).

Default tokenizer behavior occasionally leads to things like this, where the model gets lost and breaks the prompt format:

image

With the modification proposed above, It pretty much never breaks the prompt format, and also stops unnecessarily outputting empty lines or breaking the answer into paragraphs:

image

@shibe2
Copy link
Contributor

shibe2 commented Oct 18, 2023

Are there models that can output special tokens besides EOS?

@staviq
Copy link
Contributor Author

staviq commented Oct 18, 2023

Are there models that can output special tokens besides EOS?

Lots of recent mistral based models using ChatML, because they are using <s>,</s> and <|im_start|>,<|im_end|> at the same time.

@shibe2
Copy link
Contributor

shibe2 commented Oct 18, 2023

@staviq Which such model can I download to play with? Using ChatML for input does not imply ability of producing same special tokens in output.

@WolframRavenwolf
Copy link

@staviq Thanks for the insightful info, that explains a lot. So we've been using it wrong all this time and few people noticed.

@ggerganov Well, the last time I wrote C++ code was in 2000 when I worked on the Unreal Engine, so I don't feel qualified to patch the repetition penalty. Do you want me to open an issue for that or is anyone already up for the task?

@staviq (again :)) Is that a bug or a feature when the ChatML-formatted models use both? Are <s> and </s> meant for the whole prompt while <|im_start|> and <|im_end|> are for individual messages in a multi-turn chat?

@staviq
Copy link
Contributor Author

staviq commented Oct 18, 2023

@staviq Which such model can I download to play with? Using ChatML for input does not imply ability of producing same special tokens in output.

Uhmm, actually I can't reproduce that after special token handling merge, so I'm not sure now if I'm misremembering this or it was something specific I tested 3 weeks ago that I'm doing different now

Assume I was wrong about this.

@WolframRavenwolf
Copy link

WolframRavenwolf commented Oct 18, 2023

Just FYI: There's now an issue for it on the Transformers repo: huggingface/transformers#26902

Hopefully this repetition penalty issue will be fixed for all inference software soon...

@jploski
Copy link
Contributor

jploski commented Oct 18, 2023

Interestingly enough, "<|im_end|>" isn't just an added or special token, it's also the EOS token in Mistral-OpenOrca... So a multi-turn chat contains multiple EOS tokens, which by itself feels like an abuse of the concept of "EOS". But it doesn't change anything with respect to the problem of imposing an unwanted repetition penalty on such tokens.

Note that this same problem also exists in chat models that use normal tokens for separating turns (the good old "reverse prompt" parameter in llama.cpp's terms, such as "### Assistant:" - I suppose those would get repetition-penalized as well).

@WolframRavenwolf
Copy link

@jploski Yes, this has been a problem all this time. Glad it's getting the necessary attention now.

I used to think that the EOS token should never be part of the prompt, instead it should only be output by the model to end inference. The inference backend would catch it, stop inferencing, and remove it before sending the output to the frontend. That way it would never show up in the next prompt, and usually the </s> worked like that.

Now that the ChatML format is getting traction and might turn into a standard prompt template, the issue has become more apparent. But yes, the ### Assistant: string was also affected and I've seen it get "rewritten" in various ways when repetition penalty prevented it from being printed the proper way.

Same with the LLM outputting a character's name as a prefix for its message when chatting. After a while, it would fail to do so and break the chat. The solution to that was to have the frontend manipulate the prompt and add the character name itself, so the model only completes after that and never has to write the name by itself.

@teleprint-me
Copy link
Contributor

teleprint-me commented Oct 18, 2023

I've been running into this issue in the templates.

I've run several tests and it's usually always the same outcome, regardless of the model.

I know this PR is adding support for special tokens, but how were the special tokens handled internally before hand? I haven't had the opportunity to dig into the code yet.

@shibe2
Copy link
Contributor

shibe2 commented Oct 18, 2023

I experimented with Mistral-7B-OpenOrca. It indeed outputs im_end.

@teknium1
Copy link

teknium1 commented Oct 19, 2023

Yep, all chatml models made by axolotl uses <|im_end|> to seperate turns, as well as EOS token.

I havent noticed an issue with 13~ turns when inferencing Hermes 2, but I think its mostly due to the long responses it gives in my turns, which makes the start/end/assistant/user tokens that show up each turn less disasterous. I think this effect will be most prominent in short token turns - i.e. roleplaying or a model who is trained to answer in as few words as possible type scenarios, but no doubt this issue has been affecting all multiturn models in some way, whether subtlely or not since at least the first sharegpt/vicuna models came around, without anyone being aware until very recently somehow, probably due to EOS tokens being throughout all turns in chatml making the problem even more exacerbated

Also tagging huggingface/text-generation-inference#1170 here so TGI devs can see the full context between hf tgi and here. Someone should probably loop in VLLM peeps if we want to get this covered across the board, I think they only have presence penalty but I assume the same issue happens with that?

@WolframRavenwolf
Copy link

Exactly. I'm one of those chatters/roleplayers who have been experiencing these issues (responses getting longer, less coherent, until eventually derailing) since the beginning (Alpaca/Vicuna).

Used to think it's just a limit of the technology and worked around it by setting various permutations of the EOS token as stopping strings. So glad this is finally getting the attention it deserves and hopefully will be fixed globally.

That should make our local/open LLMs much more intelligent and useful once they work as intended and no longer hallucinate/derail that much. Especially the smaller models that tend to require closer adherence to the trained/tuned prompt format than bigger models which tend to cope better (but even they should profit).

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 19, 2023
* 'master' of github.com:ggerganov/llama.cpp:
  fix embeddings when using CUDA (ggerganov#3657)
  llama : avoid fprintf in favor of LLAMA_LOG (ggerganov#3538)
  readme : update hot-topics & models, detail windows release in usage (ggerganov#3615)
  CLBlast: Fix temporary buffer size for f16 conversion (wsize)
  train-text-from-scratch : fix assert failure in ggml-alloc (ggerganov#3618)
  editorconfig : remove trailing spaces
  server : documentation of JSON return value of /completion endpoint (ggerganov#3632)
  save-load-state : fix example + add ci test (ggerganov#3655)
  readme : add Aquila2 links (ggerganov#3610)
  tokenizer : special token handling (ggerganov#3538)
  k-quants : fix quantization ranges (ggerganov#3646)
  llava : fix tokenization to not add bos between image embeddings and user prompt (ggerganov#3645)
  MPT : support GQA for replit-code-v1.5 (ggerganov#3627)
  Honor -ngl option for Cuda offloading in llava (ggerganov#3621)
@jploski
Copy link
Contributor

jploski commented Oct 19, 2023

That should make our local/open LLMs much more intelligent and useful once they work as intended and no longer hallucinate/derail that much.

I doubt it will have that much effect because if repetition penalty messing up generation was the issue, we should also be seeing it for newline tokens (much more frequent than EOS / next turn tokens). The derailment is more likely due to paying equal attention to their own (garbage) outputs as anything else and eventually sampling more and more out of training distribution as a result. So yes, I'd say it is an inherent limitation of the technology. Hallucinations are also out-of-distribution sampling (to prevent you'd either have to specifically train on such garbage samples or maybe detect occurrences, maybe by scoring semantical proximity of multiple outputs of the same prompt; e.g. if the model is happy to generate "yes" and "no" to the same question with same probability, it's safe to infer it does not know the answer).

As for the repetition penalty, it is not really all that much needed if you sample from the entire distribution (high top_p) rather than greedily. The reason why repetitions happen with greedy sampling: imagine you have a biased coin (your LLM) and with each prediction you only pick the most probable outcome. With such a sampling method you'd only ever get heads or tails, but never a realistic mix of them generated by such a coin. (Note: top_k can also be very greedy if all the top k tokens are synonyms. This description also generalizes to multi-token generation, imagine throwing n different biased coins; then you get the GPT-style sentence-level loops). Finally, the repetition seems to also have someting to do with the amplitude swings in attention scores in the positional embeddings (both original GPT and RoPE).. based on anecdotal evidence that in a model with different positional embeddings (RetNet + XPos), repetitions also look different (individual token stutter - but followed by auto-recovery, no endless self-reinforcing looping).

@WolframRavenwolf
Copy link

Very insightful post! Just to add my observations:

we should also be seeing it for newline tokens (much more frequent than EOS / next turn tokens)

I think that's also visible. I used examples split into multiple paragraphs, and the splitting became inconsistent, probably because the newline became penalized. Same with other characters like dashes, quotes, asterisks - when used a lot, the model would suddenly switch, using different unicode symbols for dashes, quotes, or even switch from *this is an action* to (this is an action) or [this is an action] formats.

I think we need a better solution than simple (stupid) repetition penalty. There's also relevant discussion here: huggingface/transformers#26902

Comment on lines +2250 to +2251
// TODO: It is unclear (to me) at this point, whether special tokes are guaranteed to be of a deterministic type,
// and will always be correctly labeled in 'added_tokens.json' etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if it's irrelevant, but the added_tokens.json indeed does not have types, however with the recent update, we save the added_tokens_decoder which includes everything. See the 2 prs here and here in transformers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have worded that code comment a bit wrong, I didn't mean added_tokens.json in particular, but rather source model metadata and the way it's processed by import scripts in general.

BPE models are still problematic, at least couple of them I tested this with, had no token types imported/defined at all. So my suspicions were correct, and at this point in time that TODO is still reasonably valid.

Though the method used in this PR to detect special tokens, was/is meant to only be used as a fallback, eventually.

@cebtenzzre
Copy link
Collaborator

I played a bit with hardcoded space prefix always vs only for user input itself, and I'm getting subjectively much much better results, if the hardcoded space is only added to the user input itself and not the "prompt template" ( prefix and suffix ).

@staviq Are we sure this was the right call? We should look into what HF transformers does.

Command: build/bin/main -t 16 -m /tmp/dolphin-llama2-7b.Q4_0.new.gguf -n 2048 --ignore-eos -p 'The quick brown fox' --seed 1699464025

Before this PR (with a leading space, GitHub doesn't show it):

The quick brown fox jumps over the lazy dog is a well-known English nursery rhyme or children's song.➖

This nursery rhyme has been popular since 1847, when it was first published in the Monthly Magazine of London. Since then, it has undergone various changes and adaptations, but its essence remains the same. The rhyme tells the story of a quick brown fox who jumps over the lazy dog to avoid being caught for stealing some cheese.

After this PR (no leading space):

The quick brown fox jumps over the wall (the last line of the poem) and is on the other side.nahmáo Often used as a verb, němäo is a term derived from the Sanskrit word nirvana, meaning "spirit" or "consciousness." The concept of the spirit or consciousness can be found in many cultures and religions throughout history. In Buddhism, for instance, the term "nirvana" refers to the state of perfect spiritual awakening that is the goal of practice. Similarly, in Christianity, the Holy Spirit is considered a divine spirit that guides and supports believers.

@staviq
Copy link
Contributor Author

staviq commented Nov 14, 2023

@cebtenzzre

The problem was that making special tokens work, allowed proper prompt formats to also work, except when the space prefix was added to -p, something like SYSTEM: still wasn't working properly because SYSTEM wasnt correctly triggering model's prompt format, but SYSTEM was.

So it was a tradeoff between making prompt formats work better and making main more complicated by introducing another bunch of parameters, for splitting -p into prefix, content and suffix, each processed with/without special tokens or space prefix.

There was some talk about splitting/changing add_bos, special flags for tokenizer, and separate one for space prefix, but I don't think that PR went through, or I missed it.

Feel free to revert it, if it's causing problems.

I can't think if a way to determine whether -p is or isn't using some sort of prompt format, to decide automatically if or where should a space prefix be added.

@shibe2
Copy link
Contributor

shibe2 commented Nov 14, 2023

If a space is desirable but missing, the user can add it. If a space is added automatically but is undesirable, the user is out of luck.

@cebtenzzre
Copy link
Collaborator

@staviq
Models with GPT2-style BPE have add_prefix_space in tokenizer.json/tokenizer_config.json, I believe we need to honor that.

As for llama, this is what HF transformers does (I assume we should match the legacy=False behavior):

"""
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
first token is special.
"""
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
    tokens = tokens[1:]

And an interesting note about add_dummy_prefix (changed in this PR):

"""
Returns a tokenized string.

We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
`['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
`unk_token`. Here is an example with `unk_token = "<unk>"` and `unk_token_length = 4`.
`self.tokenizer.sp_model.encode("<unk> Hey", out_type = str)[4:]`.
"""

I think it is important to fix this regression for models that do not use special tokens in the prompt (including all of the ones that I use).

@shibe2
Copy link
Contributor

shibe2 commented Nov 15, 2023

If you don't want to add space after special tokens, only add it to a fragment if it precedes all special tokens except BOS, i.e. on the first iteration over fragment_buffer. This way, prompts without special tokens will be getting space as before.

@ArthurZucker
Copy link

If a space is desirable but missing, the user can add it. If a space is added automatically but is undesirable, the user is out of luck.

Hey 🤗 The add_prefix_space parameter will be added for llama-like tokenizers (both slow and fast) soon-ish.
The legacy is related to the space after special tokens that was always added due to the underlying add_dummy_prefix_space of the sentencepiece model.

@shibe2
Copy link
Contributor

shibe2 commented Nov 15, 2023

The add_prefix_space parameter will be added for llama-like tokenizers (both slow and fast) soon-ish.

Is there any documentation about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.