Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added an example that extends
simple
with different token healing strategies. Check the added README for examples.Token healing works by chopping off some tokens from the tokenized prompt and then constraining the decoding to match the bytes of the removed tokens.
Currently, I am just using for loops for prefix searching, but performance could potentially be improved with a prefix tree (+ caching mentioned in https://arxiv.org/abs/2403.08688).
To finish #5765, we still need to include token healing into
server
. I think the approach is to extendllama_sampling_sample
and modify the initial input tokens?