Added:
- llama-cpp-cffi server - compatible with llama.cpp cli options instead of OpenAI
Changed:
llama.cpp
revisiona813badbbdf0d38705f249df7a0c99af5cdee678
- Refactored
Options
class into two separate classes:ModelOptions
,CompletionsOptions
Removed:
- Removed ambiguous
Options
class
Changed:
- In
format_messages
, optionaloptions
argument llama.cpp
revision081b29bd2a3d91e7772e3910ce223dd63b8d7d26
Changed:
llama.cpp
revision5437d4aaf5132c879acda0bb67f2f8f71da4c9fe
Fixed:
- Replaced
tokenizer.decode(new_token_id)
with custom_common_token_to_piece(context, new_token_id, True)
Fixed:
sampler_init
becausellama_sampler_init_penalties
inllama.cpp
changed its behaviour
Changed:
llama.cpp
revision4f51968aca049080dc77e26603aa0681ea77fe45
- Build process now has global variable
LLAMA_CPP_GIT_REF
Fixed:
- Issue with Phi 3.5 based models,
tokenizer.decode([new_token_id], clean_up_tokenization_spaces=False)
Added:
Model.free
Changed:
- Fixed revision of
llama.cpp
for all wheels llama.cpp
revisionc27ac678dd393af0da9b8acf10266e760c8a0912
- disabled
llama_kv_cache_seq_cp
in_decode_tokens
Fixed:
- Batch "decode" process. NOTE: Encode part is missing for encoder-decoder models.
- Thread-safe calls to the most important functions of llama, llava, clip, ggml API.
Removed:
mllama_completions
for low-level function for Mllama-based VLMs
Added:
- New high-level Python API
- Low-level C API calls from llama.h, llava.h, clip.h, ggml.h
completions
for high-level function for LLMs / VLMstext_completions
for low-level function for LLMsclip_completions
for low-level function for CLIP-based VLMs- WIP:
mllama_completions
for low-level function for Mllama-based VLMs
Changed:
- All examples
Removed:
llama_generate
functionllama_cpp_cli
llava_cpp_cli
minicpmv_cpp_cli
Added:
- Support and examples for
llava
andminicpmv
models.
Added:
llava
high-level API callsminicpmv
high-level API support
Changed:
- Updated llama.cpp
.
Changed:
- Updated llama.cpp
.
Changed:
- Updated llama.cpp
.
Changed:
- Updated llama.cpp
with RWKV 6 support.
Changed:
- Updated llama.cpp
with RWKV support.
Added:
- LLAMA_CPP_BACKEND
which can be cuda
, vulkan
or cpu
.
Changed:
- Updated llama.cpp
.
- Updated requirements.
- CUDA backend imports only 12.6 library.
Fixed:
- Options.top_p
check using isinstance
.
Changed:
- Updated llama.cpp
.
Added:
- SmolLM-1.7B-Instruct-v0.2
examples.
Changed:
- Updated llama.cpp
.
Fixed: - Vulkan detection.
Fixed: - CUDA and Vulkan detection.
Added:
- Build vulkan_1_x
for general GPU.
- Build cuda 12.4.1
as default.
Changed:
- Renamed examples for TinyLlama (chat, tool calling) and OpenAI.
- Updated demo models definitions.
- Updated examples (chat, tool calling).
- get_special_tokens
not supports parameter force_standard_special_tokens: bool=False
which bypasses tokenizer's special tokens with standard/common ones.
- Build cuda 12.5.1
as additional build target but packaged on PyPI.
- Build cuda 12.6
as additional build target but packaged on PyPI.
- Build openblas
as additional build target but packaged on PyPI.
Fixed:
- Handle Options.no_display_prompt
on Python side.
Changed:
- OpenAI: allow import of routes
and v1_chat_completions
handler.
- examples/demo_0.py
, tool calling example.
Added:
- In openai
, support for prompt
and extra_body
. Reference: https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/src/openai/resources/completions.py#L41
- Pass llama-cli
options to openai
.
- util
module with is_cuda_available
function.
- openai
supports both prompt
and messages
. Reference: https://github.com/openai/openai-python/blob/195c05a64d39c87b2dfdf1eca2d339597f1fce03/src/openai/resources/completions.py#L45
Added: - Support for default CPU tinyBLAS (llamafile, sgemm) builds. - Support for CPU OpenBLAS (GGML_OPENBLAS) builds.
Changed:
- Build scripts now have separate step/function cuda_12_5_1_setup
which setups CUDA 12.5.1 env for build-time.
Fixed:
- Stop thread in llama_generate
on GeneratorExit
.
Removed:
- callback
parameter in llama_generate
and dependent functions.
Added:
- Model.tokenizer_hf_repo
as optional in case when Model.creator_hf_repo
cannot be used to tokenize / format prompt/messages.
Added:
- Support for stop
tokens/words.
Changed:
- llama/llama_cli.py
unified CPU and CUDA 12.5 modules into single module.
Removed: - Removed separate examples for CPU and CUDA 12.5 modules.
Changed:
- Updated huggingface-hub
.
Fixed:
- llama.__init__
now correctly imports submodules and handles CPU and CUDA backends.
- OpenAI: ctx_size: int = config.max_position_embeddings if max_tokens is None else max_tokens
.
Fixed: - Build for linux, upx uses best compression option, 7z uses more aggressive compression. - Do not use UPX for shared/dynamic library compression.
Added: - README: supported GPU Compute Capability for CUDA.
Fixed:
- Cleaned up build.py
.
- Type annotations in OpenAI related code.
Added: - Support for PyPy 3.10 versions.
Changed:
- Disabled GitHub Actions.
- Uses upx -9
to compress shared/dynamic libraries.
- Repacks whl
with better compression rate.
- Auto-detect CUDA support.
Removed: - ctypes version and demos.
Added: - Preparation for Chat Completions API by OpenAI © compatible server.
Fixed:
- Argument options
is deepcopy
-ed when passed to llama_generate(options)
, so it can be reused.
Changed:
- Build for manylinux_2_28
and musllinux_1_2
.
- Build for CUDA Compute Capability >= 6.1.
Changed:
- Updated: huggingface-hub = "^0.24.0"
, setuptools = "^71.0.3"
Added: - Park first version.