Stop the generation when <|eom_id|> token is encountered (needed for llama 3.1 tool call support) #8858
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for <|eom_id|> token introduced by llama 3.1 models. It adds the EOM token to the list of tokens that stop the generation. This is necessary to allow proper tool call support, see https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/ for more details.
Note that it doesn't add any tool call support in llama.cpp, it only stops the generation after <|eom_id|> to allow implementation of tool calls in other software that uses llama.cpp for inference.
I don't feel confident enough to tinker with
LlamaModel::set_vocab()
inconvert_hf_to_gguf.py
script to explicitly set the EOM token value during conversion, so it's currently found during vocabulary loading like EOT tokens.I created a simple script allowing to test the llama 3.1 tool calling with llama-server: https://github.com/fairydreaming/tlcl