Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add token healing to main and server #7187

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mare5x
Copy link

@mare5x mare5x commented May 9, 2024

Add different token healing strategies to main and server. Token healing works by chopping off some tokens from the tokenized prompt and then constraining the decoding to match the bytes of the removed tokens.

Example usage:

# with token healing
$ ./llama-cli -m ./models/phi-2/ggml-model-q4_0.gguf --temp 0 -p "print('Hello, wo" -th d
print('Hello, world!')

# without token healing
$ ./llama-cli -m ./models/phi-2/ggml-model-q4_0.gguf --temp 0 -p "print('Hello, wo" -th 0
print('Hello, woof!')

Server usage:

prompt='print(\"Hello, Wo'

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '
    {
      "prompt": '\""${prompt}"\"',
      "n_predict": 10,
      "temperature": 0
    }
' | jq '. | {prompt, content}'

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '
    {
      "prompt": '\""${prompt}"\"',
      "n_predict": 10,
      "temperature": 0,
      "token_healing": "d"
    }
' | jq '. | {prompt, content}'

Solves #5765.

@mare5x mare5x mentioned this pull request May 9, 2024
@mofosyne mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024
@mofosyne mofosyne marked this pull request as draft May 15, 2024 01:56
@mare5x mare5x force-pushed the token-healing-main branch 2 times, most recently from 8c44086 to c9b2297 Compare May 22, 2024 16:10
@mare5x mare5x marked this pull request as ready for review May 22, 2024 16:12
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, though I haven't ran any tests. Would appreciate if people play with this and provide some feedback before merging

Overall, I'm a bit concerned that the logic for managing the sampling state and the KV cache in the examples starts to become very convoluted when we add such kind of techniques (e.g. context shifting, self-extend, templates, speculation, ...). We have to start refactoring this and simplify it.

It's not specific to this PR, just a general thought that I have lately. Will try to prioritise this a bit, though I don't have a very good idea what specifically needs to be done yet, so any ideas and help from others of how to simplify the code are appreciated

@mofosyne mofosyne added need feedback Testing and feedback with results are needed help wanted Extra attention is needed labels May 25, 2024
@mare5x mare5x force-pushed the token-healing-main branch from 8abdf56 to ffca1bb Compare July 1, 2024 13:12
@mare5x mare5x changed the title main : add token healing Add token healing to main and server Jul 1, 2024
@github-actions github-actions bot added the server label Jul 1, 2024
@mare5x mare5x requested a review from ggerganov July 1, 2024 13:19
@mare5x mare5x force-pushed the token-healing-main branch from ffca1bb to 50af2fc Compare July 8, 2024 08:53
examples/server/README.md Outdated Show resolved Hide resolved
@mare5x mare5x force-pushed the token-healing-main branch from 2ffe10a to b27f87d Compare August 10, 2024 18:35
@mare5x
Copy link
Author

mare5x commented Aug 10, 2024

I am sharing my results of evaluating token healing with a simple evaluation script that uses either llama-cli or the server's /completion endpoint.

  • model: tiny_starcoder_py.Q8_0.gguf (164M params)
  • dataset: RepoEval
    • line: line-level RepoEval dataset (predict one full line of code), 1k context size
    • lineR: pass --randomize_target to the eval script (start completing from a random position in the target line)
  • generation args: -c 1536 --temp 0 -n 25 for cli and '{"temperature": 0, "n_predict": 25, "stop": ["\n"]}' for server (+ token healing option)
  • metrics: exact match (EM); edit similarity (ES)
token healing line EM line ES lineR EM lineR ES
server none (0) 2.81 24.18 21.31 48.75
server 1 25.06 51.25 41.19 63.88
server d1 25.06 51.25 46.00 66.78
server r3 24.56 50.98 48.00 69.01
server d 25.06 51.25 48.25 69.17
cli none (0) 3.62 26.46 20.06 48.41
cli d 25.06 52.02 45.56 67.88
  • With line, all token healing options simply roll back the last \n which is why the token healing results are similar.
  • d > r3 > d1 > 1

@ggerganov ggerganov mentioned this pull request Aug 15, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request examples help wanted Extra attention is needed need feedback Testing and feedback with results are needed Review Complexity : High Generally require indepth knowledge of LLMs or GPUs server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants