Add token healing to `main` and `server` #7187

mare5x · 2024-05-09T21:36:08Z

Add different token healing strategies to main and server. Token healing works by chopping off some tokens from the tokenized prompt and then constraining the decoding to match the bytes of the removed tokens.

Example usage:

# with token healing
$ ./llama-cli -m ./models/phi-2/ggml-model-q4_0.gguf --temp 0 -p "print('Hello, wo" -th d
print('Hello, world!')

# without token healing
$ ./llama-cli -m ./models/phi-2/ggml-model-q4_0.gguf --temp 0 -p "print('Hello, wo" -th 0
print('Hello, woof!')

Server usage:

prompt='print(\"Hello, Wo'

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '
    {
      "prompt": '\""${prompt}"\"',
      "n_predict": 10,
      "temperature": 0
    }
' | jq '. | {prompt, content}'

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '
    {
      "prompt": '\""${prompt}"\"',
      "n_predict": 10,
      "temperature": 0,
      "token_healing": "d"
    }
' | jq '. | {prompt, content}'

Solves #5765.

ggerganov

Looks OK, though I haven't ran any tests. Would appreciate if people play with this and provide some feedback before merging

Overall, I'm a bit concerned that the logic for managing the sampling state and the KV cache in the examples starts to become very convoluted when we add such kind of techniques (e.g. context shifting, self-extend, templates, speculation, ...). We have to start refactoring this and simplify it.

It's not specific to this PR, just a general thought that I have lately. Will try to prioritise this a bit, though I don't have a very good idea what specifically needs to be done yet, so any ideas and help from others of how to simplify the code are appreciated

examples/server/README.md

Dynamic rollback now starts checking prefixes based on the length of the longest token.

Infill tokens were being rolled back in certain cases.

Unify `main` and `server` token healing argument handling.

mare5x · 2024-08-10T18:38:00Z

I am sharing my results of evaluating token healing with a simple evaluation script that uses either llama-cli or the server's /completion endpoint.

model: tiny_starcoder_py.Q8_0.gguf (164M params)
dataset: RepoEval
- line: line-level RepoEval dataset (predict one full line of code), 1k context size
- lineR: pass --randomize_target to the eval script (start completing from a random position in the target line)
generation args: -c 1536 --temp 0 -n 25 for cli and '{"temperature": 0, "n_predict": 25, "stop": ["\n"]}' for server (+ token healing option)
metrics: exact match (EM); edit similarity (ES)

	token healing	`line` EM	`line` ES	`lineR` EM	`lineR` ES
server	none (0)	2.81	24.18	21.31	48.75
server	1	25.06	51.25	41.19	63.88
server	d1	25.06	51.25	46.00	66.78
server	r3	24.56	50.98	48.00	69.01
server	d	25.06	51.25	48.25	69.17
cli	none (0)	3.62	26.46	20.06	48.41
cli	d	25.06	52.02	45.56	67.88

With line, all token healing options simply roll back the last \n which is why the token healing results are similar.
d > r3 > d1 > 1

mare5x mentioned this pull request May 9, 2024

Add token healing example #7028

Closed

mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024

MaggotHATE mentioned this pull request May 11, 2024

How you manage the error in evaluate_main()? MaggotHATE/Llama_chat#2

Closed

mofosyne marked this pull request as draft May 15, 2024 01:56

mare5x force-pushed the token-healing-main branch 2 times, most recently from 8c44086 to c9b2297 Compare May 22, 2024 16:10

mare5x marked this pull request as ready for review May 22, 2024 16:12

github-actions bot added the examples label May 22, 2024

ggerganov approved these changes May 23, 2024

View reviewed changes

mofosyne added need feedback Testing and feedback with results are needed help wanted Extra attention is needed labels May 25, 2024

teleprint-me mentioned this pull request Jun 11, 2024

Feature Request: Add vocabulary type for token-free models that work on raw bytes #7763

Closed

4 tasks

mare5x force-pushed the token-healing-main branch from 8abdf56 to ffca1bb Compare July 1, 2024 13:12

mare5x changed the title ~~main : add token healing~~ Add token healing to main and server Jul 1, 2024

github-actions bot added the server label Jul 1, 2024

mare5x requested a review from ggerganov July 1, 2024 13:19

mare5x force-pushed the token-healing-main branch from ffca1bb to 50af2fc Compare July 8, 2024 08:53

shibe2 reviewed Jul 8, 2024

View reviewed changes

examples/server/README.md Outdated Show resolved Hide resolved

mare5x added 10 commits August 9, 2024 17:50

main : add token healing

13885c7

token healing : change dynamic rollback

db9c018

Dynamic rollback now starts checking prefixes based on the length of the longest token.

token healing : refactor to return struct

414fc13

token healing : handle more special tokens

fc8773d

Infill tokens were being rolled back in certain cases.

server : add token healing support

d5eea13

server : token healing for infilling/FIM

3ba5c55

token healing : refactor argument parsing

ea4abc9

Unify `main` and `server` token healing argument handling.

token healing : change argument order

b317368

readme : list possible token healing values

940ab81

token healing : fix rebase bug

b27f87d

mare5x force-pushed the token-healing-main branch from 2ffe10a to b27f87d Compare August 10, 2024 18:35

ggerganov mentioned this pull request Aug 15, 2024

llama : refactor sampling #8643

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add token healing to `main` and `server` #7187

Add token healing to `main` and `server` #7187

mare5x commented May 9, 2024 •

edited

Loading

ggerganov left a comment

mare5x commented Aug 10, 2024

Add token healing to main and server #7187

Are you sure you want to change the base?

Add token healing to main and server #7187

Conversation

mare5x commented May 9, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

mare5x commented Aug 10, 2024

Add token healing to `main` and `server` #7187

Add token healing to `main` and `server` #7187

mare5x commented May 9, 2024 •

edited

Loading