Merge pull request #356 from janhq/chore/documentation-0.2.10

Documentation 0.2.10
janhq · Jan 17, 2024 · f4ac173 · f4ac173
2 parents 33c9540 + 0318d77
commit f4ac173
Show file tree

Hide file tree

Showing 7 changed files with 75 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -125,14 +125,17 @@ Table of parameters
 | `n_batch`       | Integer | The batch size for prompt eval step |
 | `caching_enabled` | Boolean | To enable prompt caching or not   |
 | `clean_cache_threshold` | Integer | Number of chats that will trigger clean cache action|
+|`grp_attn_n`|Integer|Group attention factor in self-extend|
+|`grp_attn_w`|Integer|Group attention width in self-extend|
 
 ***OPTIONAL***: You can run Nitro on a different port like 5000 instead of 3928 by running it manually in terminal
 ```zsh
-./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port])
+./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port] [uploads_folder_path])
 ```
 - thread_num : the number of thread that nitro webserver needs to have
 - host : host value normally 127.0.0.1 or 0.0.0.0
 - port : the port that nitro got deployed onto
+- uploads_folder_path: custom path for file uploads in Drogon.
 
 Nitro server is compatible with the OpenAI format, so you can expect the same output as the OpenAI ChatGPT API.
 

diff --git a/docs/docs/examples/chatboxgpt.md b/docs/docs/examples/chatboxgpt.md
@@ -1,5 +1,5 @@
 ---
-title: Nitro on browser
+title: Nitro with ChatGPTBox
 description: Nitro intergration guide for using on Web browser.
 keywords: [Nitro, Google Chrome, browser, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama]
 ---

diff --git a/docs/docs/features/grammar.md b/docs/docs/features/grammar.md
@@ -0,0 +1,33 @@
+---
+title: GBNF Grammar
+description: What Nitro supports
+keywords: [Nitro, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama]
+---
+
+## GBNF Grammar
+
+GBNF (GGML BNF) makes it easy to set rules for how a model talks or writes. Think of it like teaching the model to always speak correctly, whether it's in emoji or proper JSON format.
+
+Bakus-Naur Form (BNF) is a way to describe the rules of computer languages, files, and how they talk to each other. GBNF builds on BNF, adding modern features similar to those found in regular expressions.
+
+In GBNF, we create rules (production rules) to guide how a model forms its responses. These rules use a mix of fixed characters (like letters or emojis) and flexible parts that can change. Each rule follows a format: `nonterminal ::= sequence...`.
+
+To get a clearer picture, check out [this guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
+
+## Use GBNF Grammar in Nitro
+
+To make your Nitro model follow specific speaking or writing rules, use this command:
+
+```bash title="Nitro Inference With Grammar" {10}
+curl http://localhost:3928/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "Who won the world series in 2020?"
+      },
+    ],
+    "grammar_file": "/path/to/grammarfile"
+  }'
+```
diff --git a/docs/docs/features/load-unload.md b/docs/docs/features/load-unload.md
@@ -77,4 +77,6 @@ In case you got error while loading models. Please check for the correct model p
 | `ai_prompt`        | String  | The prompt to use for the AI assistant.                      |
 | `system_prompt`    | String  | The prompt for system rules.                          |
 | `pre_prompt`    | String  | The prompt to use for internal configuration.                          |
-|`clean_cache_threshold`|	Integer|	Number of chats that will trigger clean cache action.|
+|`clean_cache_threshold`|	Integer|	Number of chats that will trigger clean cache action.|
+|`grp_attn_n`|Integer|Group attention factor in self-extend|
+|`grp_attn_w`|Integer|Group attention width in self-extend|
diff --git a/docs/docs/features/multi-thread.md b/docs/docs/features/multi-thread.md
@@ -22,12 +22,13 @@ For more information on threading, visit [Drogon's Documentation](https://github
 To increase the number of threads used by Nitro, use the following command syntax:
 
 ```bash title="Nitro deploy server format"
-nitro [thread_num] [host] [port]
+nitro [thread_num] [host] [port] [uploads_folder_path]
 ```
 
 - **thread_num:** Specifies the number of threads for the Nitro server.
 - **host:** The host address normally `127.0.0.1` (localhost) or `0.0.0.0` (all interfaces).
 - **port:** The port number where Nitro is to be deployed.
+- **uploads_folder_path:** To set a custom path for file uploads in Drogon. Otherwise, it uses the current folder as the default location. 
 
 To launch Nitro with 4 threads, enter this command in the terminal:
 ```bash title="Example"

diff --git a/docs/docs/features/self-extend.md b/docs/docs/features/self-extend.md
@@ -0,0 +1,29 @@
+---
+title: Self extend
+description: Self-Extend LLM Context Window Without Tuning
+keywords: [long context, longlm, Nitro, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama]
+---
+
+## Enhancing LLMs with Self-Extend
+Self-Extend offers an innovative approach to increase the context window of Large Language Models (LLMs) without the usual need for re-tuning. This method adapts the attention mechanism during the inference phase and eliminates the necessity for additional training or fine-tuning.
+
+For in-depth technical insights, refer to their research [paper](https://arxiv.org/pdf/2401.01325.pdf).
+
+## Activating Self-Extend for LLMs
+
+To activate the Self-Extend feature while loading your model, use the following command:
+
+```bash title="Enable Self-Extend" {6,7}
+curl http://localhost:3928/inferences/llamacpp/loadmodel \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "llama_model_path": "/path/to/your_model.gguf",
+    "ctx_len": 8192,
+    "grp_attn_n": 4,
+    "grp_attn_w": 2048,
+  }'
+```
+
+**Note:** 
+- For optimal performance, `grp_attn_w` should be as large as possible, but smaller than the training context length.
+- Setting  `grp_attn_n` between 2 to 4 is recommended for peak efficiency. Higher values may result in increased incoherence in output.
diff --git a/docs/sidebars.js b/docs/sidebars.js
@@ -52,7 +52,9 @@ const sidebars = {
         "features/load-unload",
         "features/warmup",
         "features/prompt",
-        "features/log"
+        "features/log",
+        "features/self-extend",
+        "features/grammar",
       ],
     },
     {