llamafile v0.8.15
The --chat
bot interface now supports syntax highlighting 42 separate programming languages: ada, asm, basic, c, c#, c++, cobol, css, d, forth, fortran, go, haskell, html, java, javascript, json, kotlin, ld, lisp, lua, m4, make, markdown, matlab, pascal, perl, php, python, r, ruby, rust, scala, shell, sql, swift, tcl, tex, txt, typescript, and zig.
That chatbot now supports more commands:
/undo
may be used to have the LLM forget the last thing you said. This is useful when you get a poor response and want to try asking your question a different way, without needing to start the conversation over from scratch./push
and/pop
works similarly, in the sense that it allows you to rewind a conversation to a previous state. In this case, it does so by creating save points within your context window. Additionally,/stack
may be used to view the current stack./clear
may be used to reset the context window to the system prompt, effectively starting your conversation over./manual
may be used to put the chat interface in "manual mode" which lets you (1) inject system prompts, and (2) speak as the LLM. This could be useful in cases where you want the LLM to believe it said something when it actually didn't./dump
may be used to print out the raw conversation history, including special tokens (that may be model specific). You can also say/dump filename.txt
to save the raw conversation to a file.
We identified an issue with Google's Gemma models, where the chatbot wasn't actually inserting the system prompt. That's now fixed. So you can now instruct Gemma to do roleplaying if you pass the flags llamafile -m gemma.gguf -p "you are role playing as foo" --chat
.
You can now type CTRL-J to create multi-line prompts in the terminal chatbot. It works similarly to shift-enter in the browser. It can be a quicker alternative to using the chatbot's triple quote syntax, i.e. """multi-line / message"""
.
Bugs in the new chatbot have been fixed. For example, we now do a better job making sure special tokens like BOS, EOS, and EOT get inserted when appropriate into the conversation history. This should improve fidelity when using the terminal chatbot interface.
The --threads
and --threads-batch
flags may now be used separately to tune how many threads are used for prediction and prefill.
The llamafile-bench command now supports benchmarking GPU support (see #581 from @cjpais)
Both servers now support configuring a URL prefix, thanks to (see #597 and #604 from @vlasky)
Support for the IQ quantization formats is being removed from our CUDA module to save on build times. If you want to use IQ quants with your NVIDIA hardware, you need to pass the --iq --recompile
flags to llamafile once, to build a ggml-cuda module for your system that includes them.
Finally, we have an alpha release of a new /v1/chat/completions
endpoint for the new llamafiler
server. We're planning to build a new web interface that's based on this soon, so you're encouraged to test this, since llamafiler will eventually replace the old server too. File an issue if there's any features you need.