cortex.llamacpp

cortex.llamacpp is a high-efficiency C++ inference engine for edge computing.

It is a dynamic library that can be loaded by any server at runtime.

Repo Structure

.
├── base -> Engine interface
├── examples -> Server example to integrate engine
├── llama.cpp -> Upstream llama C++
├── src -> Engine implementation
├── third-party -> Dependencies of the cortex.llamacpp project

Build from source

This guide provides step-by-step instructions for building cortex.llamacpp from source on Linux, macOS, and Windows systems.

Clone the Repository

First, you need to clone the cortex.llamacpp repository:

git clone --recurse https://github.com/janhq/cortex.llamacpp.git

If you don't have git, you can download the source code as a file archive from cortex.llamacpp GitHub.

Build library with server example

On Windows Install choco Install make
```
choco install make -y
```

On Linux, and Windows:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON"

On MacOS with Apple Silicon:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON -DGGML_METAL_EMBED_LIBRARY=ON"

On MacOS with Intel processors:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON -DGGML_METAL=OFF"

On Linux with CUDA:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON -DGGML_CUDA=ON"

Quickstart

Step 1: Downloading a Model

mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Step 2: Start server

On MacOS:

mkdir engines\cortex.llamacpp\
cp ../../build/libengine.dylib engines/cortex.llamacpp/
./server

On Linux:

mkdir engines\cortex.llamacpp\
cp ../../build/libengine.so engines/cortex.llamacpp/
./server

On Windows:

cd Release
mkdir engines\cortex.llamacpp\
copy ..\..\..\build\Release\engine.dll engines\cortex.llamacpp\
server.exe

Step 3: Load model

curl http://localhost:3928/loadmodel \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/model/llama-2-7b-model.gguf",
    "model_alias": "llama-2-7b-model",
    "ctx_len": 512,
    "ngl": 100,
    "model_type": "llm"
  }'

Step 4: Making an Inference

curl http://localhost:3928/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Who won the world series in 2020?"
      },
    ],
    "model": "llama-2-7b-model"
  }'

Table of parameters

Parameter	Type	Description
`llama_model_path`	String	The file path to the LLaMA model.
`ngl`	Integer	The number of GPU layers to use.
`ctx_len`	Integer	The context length for the model operations.
`embedding`	Boolean	Whether to use embedding in the model.
`n_parallel`	Integer	The number of parallel operations.
`cont_batching`	Boolean	Whether to use continuous batching.
`user_prompt`	String	The prompt to use for the user.
`ai_prompt`	String	The prompt to use for the AI assistant.
`system_prompt`	String	The prompt to use for system rules.
`pre_prompt`	String	The prompt to use for internal configuration.
`cpu_threads`	Integer	The number of threads to use for inferencing (CPU MODE ONLY)
`n_batch`	Integer	The batch size for prompt eval step
`caching_enabled`	Boolean	To enable prompt caching or not
`grp_attn_n`	Integer	Group attention factor in self-extend
`grp_attn_w`	Integer	Group attention width in self-extend
`mlock`	Boolean	Prevent system swapping of the model to disk in macOS
`grammar_file`	String	You can constrain the sampling using GBNF grammars by providing path to a grammar file
`model_type`	String	Model type we want to use: llm or embedding, default value is llm
`model_alias`	String	Used as model_id if specified in request, mandatory in loadmodel
`model`	String	Used as model_id if specified in request, mandatory in chat/embedding request
`flash_attn`	Boolean	To enable Flash Attention, default is true
`cache_type`	String	KV cache type: f16, q8_0, q4_0, default is f16
`use_mmap`	Boolean	To enable mmap, default is true

Name		Name	Last commit message	Last commit date
Latest commit History 310 Commits
.github		.github
base/cortex-common		base/cortex-common
examples/server		examples/server
llama.cpp @ 0e70ba6		llama.cpp @ 0e70ba6
patches		patches
scripts		scripts
src		src
third-party		third-party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cortex.llamacpp

Repo Structure

Build from source

Clone the Repository

Build library with server example

Quickstart

About

Releases 112

Packages

Contributors 8

Languages

License

janhq/cortex.llamacpp

Folders and files

Latest commit

History

Repository files navigation

cortex.llamacpp

Repo Structure

Build from source

Clone the Repository

Build library with server example

Quickstart

About

Resources

License

Stars

Watchers

Forks

Releases 112

Packages 0

Contributors 8

Languages

Packages