Releases: ngxson/wllama
1.17.1
1.17.0
1.16.4
1.16.3
What's Changed
Thanks to a small refactoring on llama.cpp, be binary size is now reduced from 1.78MB to 1.52MB
Full Changelog: 1.16.2...1.16.3
1.16.2
1.16.1
1.16.0
SmolLM-360m is added as a model in main
example. Try it now --> https://huggingface.co/spaces/ngxson/wllama
Special thanks to @huggingface team for providing a such powerful model in a very small size!
What's Changed
Full Changelog: 1.15.0...1.16.0
1.15.0
New features
downloadModel()
Download model to cache without loading it. The use case would be to allow application to have a "model manager" screen that allows:
- Download model via
downloadModel()
- List all downloaded models using
CacheManager.list()
- Delete a downloaded model using
CacheManager.delete()
KV cache reuse in createCompletion
When calling createCompletion
, you can pass useCache: true
as an option. It will reuse the KV cache from the last createCompletion
call. It is equivalent to cache_prompt
option on llama.cpp server.
wllama.createCompletion(input, {
useCache: true,
...
});
For example:
- On the first call, you have 2 messages:
user: hello
,assistant: hi
- On the second call, you add one message:
user: hello
,assistant: hi
,user: who are you?
Then, only the added message user: who are you?
will need to be evaluated.
What's Changed
- Add
downloadModel
function by @ngxson in #95 - fix log print and
downloadModel
by @ngxson in #100 - Add
main
example (chat UI) by @ngxson in #99 - Improve main UI example by @ngxson in #102
- implement KV cache reuse by @ngxson in #103
Full Changelog: 1.14.2...1.15.0
1.14.2
Update to latest upstream llama.cpp source code:
- Fix support for llama-3.1, phi 3 and SmolLM
Full Changelog: 1.14.0...1.14.2