-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement prompt template for chat completion #717
Comments
This is a hefty task, with architecture / design elements to do it in a clean way. I am busy to take it on myself right now, but in a couple weeks I can try if nobody else has done it. |
Hey @ehartford this is actually something I've had in the backlog and just started last night in #711 My plan is to have those format identifiers and also provide a generic class (?) that users can extend to provide a custom chat template. The challenge is that it's not just the prompt that has to be modified but also stop sequences, grammar (in the case of open ai style function calling chats), and a few more things I probably haven't thought about but I think this is do-able. Thank you for the resources btw! |
Awesome thanks! I think it can be done without requiring the user to write any code, using a clever template system, as is implement by ooba and fastchat. |
Like workaround. I use |
true, |
then, my solution is that I will make a proxy that receives calls to /chat/completions, and rewrites them into a call into llama-cpp-python's /completions endpoint, in order to inject the proper prompt format. |
I just ran through some rough drafts with GPT. Proposal for Advanced Customizable Prompts in Chat CompletionsProblem StatementThe existing implementation for chat completions uses hard-coded prompts, constraining customization and flexibility. This limitation becomes evident when adapting the code for specific projects or applications that require unique prompt styles or formats. PROMPT = chat_history + "### Assistant:"
PROMPT_STOP = ["### Assistant:", "### Human:"] Proposed SolutionI propose two new optional parameters, def create_chat_completion(
# ...existing parameters...
prompt: Optional[str] = None,
prompt_stop: Optional[List[str]] = None,
):
# ...
PROMPT = chat_history + (prompt if prompt else "### Assistant:")
PROMPT_STOP = prompt_stop if prompt_stop else ["### Assistant:", "### Human:"]
# ... Benefits
Backward CompatibilityThe proposal maintains backward compatibility since both new parameters are optional and will use existing hard-coded values as defaults. Suggested DefaultsIn absence of custom prompts, the system could default to prompts styled after Llama-2's structure as a sane default: B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """You are a helpful assistant.""" Practical Examples my_custom_prompt = ">>> Custom Assistant:"
my_custom_stop = [">>> Custom Assistant:", ">>> Custom User:"]
create_chat_completion(
messages=...,
prompt=my_custom_prompt,
prompt_stop=my_custom_stop,
# ...other parameters...
) Related WorksThis proposal aims to integrate well with ongoing work in the I'm not sure if this fits well with what you guys had in mind. Let me know either way. I had the same idea though. |
I was looking Open Interpreter and the source code was using litellm. So, I figured I'd take a peek at it and vllm. I checked out the docs for litellm templates and they have a fairly nice structure for prefixing and postfixing. # Create your own custom prompt template works
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST] ",
"post_message": " [/INST]\n"
},
"assistant": {
"post_message": "\n"
}
}
)
def test_huggingface_custom_model():
model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
print(response['choices'][0]['message']['content'])
return response
test_huggingface_custom_model() Found it pretty interesting because you can feed in the structure as a I ran through it with GPT again and this is what it came up with as a proof-of-concept. Revised Proposal for Role-Based Customizable Prompts in Chat CompletionsProblem StatementThe current chat completions implementation relies on hard-coded prompts, limiting customization and flexibility. This is a bottleneck when adapting the code to specialized projects requiring unique role-based prompt styles or formats. Proposed SolutionReplace existing def create_chat_completion(
# ...existing parameters...
role_templates: Optional[Dict[str, Dict[str, str]]] = None
):
# ...existing code... Benefits
Backward CompatibilityThis change maintains backward compatibility since the Suggested DefaultsA reasonable default could mirror Llama-2's prompt structure: DEFAULT_ROLE_TEMPLATES = {
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST]",
"post_message": " [/INST]\n"
},
"assistant": {
"post_message": "\n"
}
} Related Works
I know it won't be that simple after reviewing the code. Just wanted to share. Maybe it would inspire something. |
What models actually use the current chat prompt template? It seems most models use Alpaca's format:
|
It depends on the dataset and how it's trained and/or finetuned. The format varies from model to model, but the 2 most popular formats are usually "### Instruction:" and "### Assistant:" and "### Human:" and "Assistant:" Sometimes it's "### Human:" and "### Bot:" Open Assistant uses a mixture depending on version and dataset, "prompter:", or "human:", and "assistant:". Some models are more complex than others, e.g. it's system prompt, input, instruction, and then response. There's no fixed, or commonly accepted, format yet as far as I can tell. Most chat models follow system, user, assistant, or some variation. Whether there are tokens that are used to denote which is which depends. |
The closest thing to standard is ChatML. And it's not widely accepted. I've adopted it, and open assistant has adopted it. Vicuna and wizardLM haven't. Hopefully a consensus emerges in the next year. |
I'm for ChatML. The high-level interface is intuitive and easy to reason about and follow. The low-level interface is similar to what Meta did with Llama-2's chat interface. The tokens could probably be simplified though. Maybe the use of something more like markup would be an improvement? <system>System prompt goes here</system>
<user>User prompt goes here</user>
<assistant> And just "teach" the model that The the output could be parsed similar to XML/HTML. I'm still learning, so just take what I'm saying with a friendly grain of salt. This is something I plan on experimenting with if I get the opportunity to do it in the future. I agree though, a consensus would be nice. |
The current scheme implemented in llama-cpp-python doesn't follow a convention I know of. Please see the links in my original issue for a comprehensive and detailed list of the currently popular prompt templates. 90%+ of use cases will be covered if the following formats are supported:
The best source of documentation on these prompt formats is probably the model cards in TheBloke's distributions which are very well researched. |
Hey @ehartford I just merged in the #711 PR which adds a mechanism to specify common chat formats through a Currently supports:
Let me know if that works for you! |
Nice! |
ChatML would be lovely it's garnering more support |
I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable. For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general. Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a |
Or better yet, accept a general lambda as an argument and implement the YAML idea as a specific lambda that can take a YAML file and template the response. E.g.,
|
Anything besides yaml, please 🙏. Simple is always better. |
@ehartford I'll add that and a few others I missed (mistral as well). @r7l I'll consider this but likely as a utility for the server that converts a config file / template into a chat formatting function. |
Thank you for your work on chat templates and llama-cpp-python generally!! Curious if you could just universally piggy back the HuggingFace template hub or let users specify a tokenizer_config.json to completely outsource it to this developing standard of rendering arbitrary Jinja provided with the model? I would be surprised if new model releases don't all start coming with their own tokenizer config definition. EDIT: Delivered the above in linked PR |
The Prompt template: ChatML does not stop generating when using Prompt template: ChatML. I think we need to add a stop token. It worked for me.
|
Default stop token would be huge for realizing transparent model provider. |
There's a problem with using the stop tokens. I'm not sure what the difference is yet, but I noticed that using the special tokens in the user facing templates causes a lot of issues. I would advise not using special tokens at all with llama.cpp. In almost every test I conducted, the models started repeating themselves, derailing, and more. Using the base template seems to work beautifully though. Not a single issue once I do that. |
So you exclude them from the template, but still you would set it as a default stop sequence item right? Saves you from having to specify it in the payload. That will be needed to realize total model backing transparency to fully decouple model from chat consumer other than maybe max token in payload |
I noted it in my PR on L14. IMPORTANT NOTES:
Example using the llama-2 model and its templating schema:
This initial example is a proper template format that the model understands. It results in proper output and does not confuse the model.
This example includes the use of special tokens, and the model may or may not use these tokens as a result. The model is not expecting them during inference, which causes unexpected behavior.
This example is improperly formatted and causes the model to become confused. The model begins to fixate on tokens, uses language repetition, and eventually derails. Note that the |
I propose we use a dedicated library for this: chatformat. Additionally, what I'm missing with the current implementation is the possibility to "preface" the model's output. The issue is that the current implementation seals off the last incomplete message:
The |
we should probably use jinja templates, so the user can specify them at runtime if needed. engines won't know the template to use. users can have models they just finished fine tuning with custom grammars, etc. this will get everyone what they want.
|
@earonesty Good point about having custom templates. But I think using a templating engine is overcomplicating the matter. These chat formats generally consist of "rounds" that are stacked together. A round is defined as
We can cover 99% of all possible formats by
So for example for Alpaca, the format can be defined as: alpaca:
with_system: |-
{system}
### Instruction:
{user}
### Response:
{assistant}</s>
without_system: |-
### Instruction:
{user}
### Response:
{assistant}</s>
round_seperator: "\n\n" If you know of a format that is not covered by this convention, please comment. |
Co-authored-by: Andrei <[email protected]>
Llama-1, Llama-2, RedPajama, Mistral, Refact, etc... |
jinja2 is incredibly easy and lightweight
and it's 100% compatible with what all current researchers are producing in
there token configuration files
…On Mon, Nov 6, 2023, 9:48 AM Austin ***@***.***> wrote:
@Mwni <https://github.com/Mwni>
Llama-1, Llama-2, RedPajama, Mistral, etc...
—
Reply to this email directly, view it on GitHub
<#717 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMUIVYH5DXJBOI4WXTXTYDD2FNAVCNFSM6AAAAAA4YHXNRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJVGAYDANBWGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@teleprint-me The following prompts were generated using the proposed scheme Llama-2
Vicuna (and Mistral)
ChatML
Where's the problem? |
the problem is that jinja2 is what is sitting in hf config files. so it's future compatible with stuff you haven't heard of, and can be sucked into gguf file metadata, so that the user isn't on the hook to specify a template when working with gguf files. it has a forward compatibility path that matters. |
Having it load right from the meta data would be killer. |
Where in the specification is that? Also, ggerganov already stated he plans on using oblique templates and it will be a minimal, separate, implementation. |
gguf allows you to store any metadata you want. models at hf have jinja2 templates in their tokenizer configs., so really, it doesn't matter about the specification that much. can just add it to the convert script. |
So can I define a custom format like:
How? So far I have a dropdown box that selects the pre-defined formats. |
Is your feature request related to a problem? Please describe.
When generating chat completion, it is hard-coded to generate a non-standard prompt template that looks something like:
system message is currently ignored.
llama-cpp-python/llama_cpp/llama.py
Line 1578 in 255d653
This mostly works for most models. But it's not correct.
Describe the solution you'd like
Describe alternatives you've considered
modifying llama-cpp-python to hard code it to llama2-chat format, not a great solution.
Additional context
The text was updated successfully, but these errors were encountered: