diff --git a/README.md b/README.md index ff9cc5cac..3e06cacce 100644 --- a/README.md +++ b/README.md @@ -125,14 +125,17 @@ Table of parameters | `n_batch` | Integer | The batch size for prompt eval step | | `caching_enabled` | Boolean | To enable prompt caching or not | | `clean_cache_threshold` | Integer | Number of chats that will trigger clean cache action| +|`grp_attn_n`|Integer|Group attention factor in self-extend| +|`grp_attn_w`|Integer|Group attention width in self-extend| ***OPTIONAL***: You can run Nitro on a different port like 5000 instead of 3928 by running it manually in terminal ```zsh -./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port]) +./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port] [uploads_folder_path]) ``` - thread_num : the number of thread that nitro webserver needs to have - host : host value normally 127.0.0.1 or 0.0.0.0 - port : the port that nitro got deployed onto +- uploads_folder_path: custom path for file uploads in Drogon. Nitro server is compatible with the OpenAI format, so you can expect the same output as the OpenAI ChatGPT API. diff --git a/docs/docs/examples/chatboxgpt.md b/docs/docs/examples/chatboxgpt.md index 20abfbf0c..5ec0c5fd3 100644 --- a/docs/docs/examples/chatboxgpt.md +++ b/docs/docs/examples/chatboxgpt.md @@ -1,5 +1,5 @@ --- -title: Nitro on browser +title: Nitro with ChatGPTBox description: Nitro intergration guide for using on Web browser. keywords: [Nitro, Google Chrome, browser, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama] --- diff --git a/docs/docs/features/grammar.md b/docs/docs/features/grammar.md new file mode 100644 index 000000000..8041a5aa7 --- /dev/null +++ b/docs/docs/features/grammar.md @@ -0,0 +1,33 @@ +--- +title: GBNF Grammar +description: What Nitro supports +keywords: [Nitro, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama] +--- + +## GBNF Grammar + +GBNF (GGML BNF) makes it easy to set rules for how a model talks or writes. Think of it like teaching the model to always speak correctly, whether it's in emoji or proper JSON format. + +Bakus-Naur Form (BNF) is a way to describe the rules of computer languages, files, and how they talk to each other. GBNF builds on BNF, adding modern features similar to those found in regular expressions. + +In GBNF, we create rules (production rules) to guide how a model forms its responses. These rules use a mix of fixed characters (like letters or emojis) and flexible parts that can change. Each rule follows a format: `nonterminal ::= sequence...`. + +To get a clearer picture, check out [this guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). + +## Use GBNF Grammar in Nitro + +To make your Nitro model follow specific speaking or writing rules, use this command: + +```bash title="Nitro Inference With Grammar" {10} +curl http://localhost:3928/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Who won the world series in 2020?" + }, + ], + "grammar_file": "/path/to/grammarfile" + }' +``` diff --git a/docs/docs/features/load-unload.md b/docs/docs/features/load-unload.md index 91c7daf37..22b20340f 100644 --- a/docs/docs/features/load-unload.md +++ b/docs/docs/features/load-unload.md @@ -77,4 +77,6 @@ In case you got error while loading models. Please check for the correct model p | `ai_prompt` | String | The prompt to use for the AI assistant. | | `system_prompt` | String | The prompt for system rules. | | `pre_prompt` | String | The prompt to use for internal configuration. | -|`clean_cache_threshold`| Integer| Number of chats that will trigger clean cache action.| \ No newline at end of file +|`clean_cache_threshold`| Integer| Number of chats that will trigger clean cache action.| +|`grp_attn_n`|Integer|Group attention factor in self-extend| +|`grp_attn_w`|Integer|Group attention width in self-extend| \ No newline at end of file diff --git a/docs/docs/features/multi-thread.md b/docs/docs/features/multi-thread.md index 2fe9b23d9..cf65bfa95 100644 --- a/docs/docs/features/multi-thread.md +++ b/docs/docs/features/multi-thread.md @@ -22,12 +22,13 @@ For more information on threading, visit [Drogon's Documentation](https://github To increase the number of threads used by Nitro, use the following command syntax: ```bash title="Nitro deploy server format" -nitro [thread_num] [host] [port] +nitro [thread_num] [host] [port] [uploads_folder_path] ``` - **thread_num:** Specifies the number of threads for the Nitro server. - **host:** The host address normally `127.0.0.1` (localhost) or `0.0.0.0` (all interfaces). - **port:** The port number where Nitro is to be deployed. +- **uploads_folder_path:** To set a custom path for file uploads in Drogon. Otherwise, it uses the current folder as the default location. To launch Nitro with 4 threads, enter this command in the terminal: ```bash title="Example" diff --git a/docs/docs/features/self-extend.md b/docs/docs/features/self-extend.md new file mode 100644 index 000000000..8856c1c27 --- /dev/null +++ b/docs/docs/features/self-extend.md @@ -0,0 +1,29 @@ +--- +title: Self extend +description: Self-Extend LLM Context Window Without Tuning +keywords: [long context, longlm, Nitro, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama] +--- + +## Enhancing LLMs with Self-Extend +Self-Extend offers an innovative approach to increase the context window of Large Language Models (LLMs) without the usual need for re-tuning. This method adapts the attention mechanism during the inference phase and eliminates the necessity for additional training or fine-tuning. + +For in-depth technical insights, refer to their research [paper](https://arxiv.org/pdf/2401.01325.pdf). + +## Activating Self-Extend for LLMs + +To activate the Self-Extend feature while loading your model, use the following command: + +```bash title="Enable Self-Extend" {6,7} +curl http://localhost:3928/inferences/llamacpp/loadmodel \ + -H 'Content-Type: application/json' \ + -d '{ + "llama_model_path": "/path/to/your_model.gguf", + "ctx_len": 8192, + "grp_attn_n": 4, + "grp_attn_w": 2048, + }' +``` + +**Note:** +- For optimal performance, `grp_attn_w` should be as large as possible, but smaller than the training context length. +- Setting `grp_attn_n` between 2 to 4 is recommended for peak efficiency. Higher values may result in increased incoherence in output. \ No newline at end of file diff --git a/docs/sidebars.js b/docs/sidebars.js index 0ff99d1cc..e5931ee3d 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -52,7 +52,9 @@ const sidebars = { "features/load-unload", "features/warmup", "features/prompt", - "features/log" + "features/log", + "features/self-extend", + "features/grammar", ], }, {