Skip to content

Commit

Permalink
Merge pull request #356 from janhq/chore/documentation-0.2.10
Browse files Browse the repository at this point in the history
Documentation 0.2.10
  • Loading branch information
tikikun authored Jan 17, 2024
2 parents 33c9540 + 0318d77 commit f4ac173
Show file tree
Hide file tree
Showing 7 changed files with 75 additions and 5 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,17 @@ Table of parameters
| `n_batch` | Integer | The batch size for prompt eval step |
| `caching_enabled` | Boolean | To enable prompt caching or not |
| `clean_cache_threshold` | Integer | Number of chats that will trigger clean cache action|
|`grp_attn_n`|Integer|Group attention factor in self-extend|
|`grp_attn_w`|Integer|Group attention width in self-extend|

***OPTIONAL***: You can run Nitro on a different port like 5000 instead of 3928 by running it manually in terminal
```zsh
./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port])
./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port] [uploads_folder_path])
```
- thread_num : the number of thread that nitro webserver needs to have
- host : host value normally 127.0.0.1 or 0.0.0.0
- port : the port that nitro got deployed onto
- uploads_folder_path: custom path for file uploads in Drogon.

Nitro server is compatible with the OpenAI format, so you can expect the same output as the OpenAI ChatGPT API.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/examples/chatboxgpt.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Nitro on browser
title: Nitro with ChatGPTBox
description: Nitro intergration guide for using on Web browser.
keywords: [Nitro, Google Chrome, browser, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama]
---
Expand Down
33 changes: 33 additions & 0 deletions docs/docs/features/grammar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: GBNF Grammar
description: What Nitro supports
keywords: [Nitro, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama]
---

## GBNF Grammar

GBNF (GGML BNF) makes it easy to set rules for how a model talks or writes. Think of it like teaching the model to always speak correctly, whether it's in emoji or proper JSON format.

Bakus-Naur Form (BNF) is a way to describe the rules of computer languages, files, and how they talk to each other. GBNF builds on BNF, adding modern features similar to those found in regular expressions.

In GBNF, we create rules (production rules) to guide how a model forms its responses. These rules use a mix of fixed characters (like letters or emojis) and flexible parts that can change. Each rule follows a format: `nonterminal ::= sequence...`.

To get a clearer picture, check out [this guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).

## Use GBNF Grammar in Nitro

To make your Nitro model follow specific speaking or writing rules, use this command:

```bash title="Nitro Inference With Grammar" {10}
curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
"grammar_file": "/path/to/grammarfile"
}'
```
4 changes: 3 additions & 1 deletion docs/docs/features/load-unload.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,6 @@ In case you got error while loading models. Please check for the correct model p
| `ai_prompt` | String | The prompt to use for the AI assistant. |
| `system_prompt` | String | The prompt for system rules. |
| `pre_prompt` | String | The prompt to use for internal configuration. |
|`clean_cache_threshold`| Integer| Number of chats that will trigger clean cache action.|
|`clean_cache_threshold`| Integer| Number of chats that will trigger clean cache action.|
|`grp_attn_n`|Integer|Group attention factor in self-extend|
|`grp_attn_w`|Integer|Group attention width in self-extend|
3 changes: 2 additions & 1 deletion docs/docs/features/multi-thread.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,13 @@ For more information on threading, visit [Drogon's Documentation](https://github
To increase the number of threads used by Nitro, use the following command syntax:

```bash title="Nitro deploy server format"
nitro [thread_num] [host] [port]
nitro [thread_num] [host] [port] [uploads_folder_path]
```

- **thread_num:** Specifies the number of threads for the Nitro server.
- **host:** The host address normally `127.0.0.1` (localhost) or `0.0.0.0` (all interfaces).
- **port:** The port number where Nitro is to be deployed.
- **uploads_folder_path:** To set a custom path for file uploads in Drogon. Otherwise, it uses the current folder as the default location.

To launch Nitro with 4 threads, enter this command in the terminal:
```bash title="Example"
Expand Down
29 changes: 29 additions & 0 deletions docs/docs/features/self-extend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Self extend
description: Self-Extend LLM Context Window Without Tuning
keywords: [long context, longlm, Nitro, Jan, fast inference, inference server, local AI, large language model, OpenAI compatible, open source, llama]
---

## Enhancing LLMs with Self-Extend
Self-Extend offers an innovative approach to increase the context window of Large Language Models (LLMs) without the usual need for re-tuning. This method adapts the attention mechanism during the inference phase and eliminates the necessity for additional training or fine-tuning.

For in-depth technical insights, refer to their research [paper](https://arxiv.org/pdf/2401.01325.pdf).

## Activating Self-Extend for LLMs

To activate the Self-Extend feature while loading your model, use the following command:

```bash title="Enable Self-Extend" {6,7}
curl http://localhost:3928/inferences/llamacpp/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"ctx_len": 8192,
"grp_attn_n": 4,
"grp_attn_w": 2048,
}'
```

**Note:**
- For optimal performance, `grp_attn_w` should be as large as possible, but smaller than the training context length.
- Setting `grp_attn_n` between 2 to 4 is recommended for peak efficiency. Higher values may result in increased incoherence in output.
4 changes: 3 additions & 1 deletion docs/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ const sidebars = {
"features/load-unload",
"features/warmup",
"features/prompt",
"features/log"
"features/log",
"features/self-extend",
"features/grammar",
],
},
{
Expand Down

0 comments on commit f4ac173

Please sign in to comment.