server: tests:

* start the server at each scenario * split the features as each requires different server config
ggerganov · Feb 21, 2024 · 6406208 · 6406208
1 parent 68b8d4e
commit 6406208
Show file tree

Hide file tree

Showing 6 changed files with 197 additions and 173 deletions.
diff --git a/examples/server/tests/README.md b/examples/server/tests/README.md
@@ -7,10 +7,13 @@ Server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_
 
 ### Run tests
 1. Build the server
-2. download a GGUF model: `./scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
-3. Start the test: `./tests.sh stories260K.gguf -ngl 23`
+2. download required models:
+   1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
+3. Start the test: `./tests.sh`
+
+To change the server path, use `LLAMA_SERVER_BIN_PATH` environment variable.
 
 ### Skipped scenario
 
-Scenario must be annotated with `@llama.cpp` to be included in the scope.
+Feature or Scenario must be annotated with `@llama.cpp` to be included in the scope.
 `@bug` annotation aims to link a scenario with a GitHub issue.
diff --git a/examples/server/tests/features/environment.py b/examples/server/tests/features/environment.py
@@ -0,0 +1,4 @@
+
+def after_scenario(context, scenario):
+    print("stopping server...")
+    context.server_process.kill()
diff --git a/examples/server/tests/features/security.feature b/examples/server/tests/features/security.feature
@@ -0,0 +1,49 @@
+@llama.cpp
+Feature: Security
+
+  Background: Server startup with an api key defined
+    Given a server listening on localhost:8080
+    And   a model file stories260K.gguf
+    And   a server api key llama.cpp
+    Then  the server is starting
+
+  Scenario Outline: Completion with some user api key
+    Given a prompt test
+    And   a user api key <api_key>
+    And   4 max tokens to predict
+    And   a completion request with <api_error> api error
+
+    Examples: Prompts
+      | api_key   | api_error |
+      | llama.cpp | no        |
+      | llama.cpp | no        |
+      | hackeme   | raised    |
+      |           | raised    |
+
+  Scenario Outline: OAI Compatibility
+    Given a system prompt test
+    And   a user prompt test
+    And   a model test
+    And   2 max tokens to predict
+    And   streaming is disabled
+    And   a user api key <api_key>
+    Given an OAI compatible chat completions request with <api_error> api error
+
+    Examples: Prompts
+      | api_key   | api_error |
+      | llama.cpp | no        |
+      | llama.cpp | no        |
+      | hackme    | raised    |
+
+
+  Scenario Outline: CORS Options
+    When an OPTIONS request is sent from <origin>
+    Then CORS header <cors_header> is set to <cors_header_value>
+
+    Examples: Headers
+      | origin          | cors_header                      | cors_header_value |
+      | localhost       | Access-Control-Allow-Origin      | localhost         |
+      | web.mydomain.fr | Access-Control-Allow-Origin      | web.mydomain.fr   |
+      | origin          | Access-Control-Allow-Credentials | true              |
+      | web.mydomain.fr | Access-Control-Allow-Methods     | POST              |
+      | web.mydomain.fr | Access-Control-Allow-Headers     | *                 |
diff --git a/examples/server/tests/features/server.feature b/examples/server/tests/features/server.feature
@@ -1,127 +1,53 @@
+@llama.cpp
 Feature: llama.cpp server
 
   Background: Server startup
-    Given a server listening on localhost:8080 with 2 slots, 42 as seed and llama.cpp as api key
+    Given a server listening on localhost:8080
+    And   a model file stories260K.gguf
+    And   a model alias tinyllama-2
+    And   42 as server seed
+    And   32 KV cache size
+    And   1 slots
+    And   32 server max tokens to predict
     Then  the server is starting
     Then  the server is healthy
 
-  @llama.cpp
   Scenario: Health
-    When the server is healthy
     Then the server is ready
     And  all slots are idle
 
-  @llama.cpp
   Scenario Outline: Completion
     Given a prompt <prompt>
-    And   a user api key <api_key>
     And   <n_predict> max tokens to predict
-    And   a completion request
-    Then  <n_predict> tokens are predicted
+    And   a completion request with no api error
+    Then  <n_predicted> tokens are predicted with content: <content>
 
     Examples: Prompts
-      | prompt                           | n_predict | api_key   |
-      | I believe the meaning of life is | 128       | llama.cpp |
-      | Write a joke about AI            | 512       | llama.cpp |
-      | say goodbye                      | 0         |           |
+      | prompt                           | n_predict | content                                                                 | n_predicted |
+      | I believe the meaning of life is | 8         | <space>going to read.                                                   | 8           |
+      | Write a joke about AI            | 64        | tion came to the park. And all his friends were very scared and did not | 32          |
 
-  @llama.cpp
   Scenario Outline: OAI Compatibility
-    Given a system prompt <system_prompt>
+    Given a model <model>
+    And   a system prompt <system_prompt>
     And   a user prompt <user_prompt>
-    And   a model <model>
     And   <max_tokens> max tokens to predict
     And   streaming is <enable_streaming>
-    And   a user api key <api_key>
-    Given an OAI compatible chat completions request with an api error <api_error>
-    Then  <max_tokens> tokens are predicted
+    Given an OAI compatible chat completions request with no api error
+    Then  <n_predicted> tokens are predicted with content: <content>
 
     Examples: Prompts
-      | model        | system_prompt               | user_prompt                          | max_tokens | enable_streaming | api_key   | api_error |
-      | llama-2      | You are ChatGPT.            | Say hello.                           | 64         | false            | llama.cpp | none      |
-      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 512        | true             | llama.cpp | none      |
-      | John-Doe     | You are an hacker.          | Write segfault code in rust.         | 0          | true             | hackme    | raised    |
+      | model        | system_prompt               | user_prompt                          | max_tokens | content                                                                       | n_predicted | enable_streaming |
+      | llama-2      | Book                        | What is the best book                | 8          | "Mom, what'                                                                   | 8           | disabled         |
+      | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64         | "Hey," said the bird.<LF>The bird was very happy and thanked the bird for hel | 32          | enabled          |
 
-  @llama.cpp
-  Scenario: Multi users
-    Given a prompt:
-      """
-      Write a very long story about AI.
-      """
-    And a prompt:
-      """
-      Write another very long music lyrics.
-      """
-    And 32 max tokens to predict
-    And a user api key llama.cpp
-    Given concurrent completion requests
-    Then the server is busy
-    And  all slots are busy
-    Then the server is idle
-    And  all slots are idle
-    Then all prompts are predicted
-
-  @llama.cpp
-  Scenario: Multi users OAI Compatibility
-    Given a system prompt "You are an AI assistant."
-    And   a model tinyllama-2
-    Given a prompt:
-      """
-      Write a very long story about AI.
-      """
-    And a prompt:
-      """
-      Write another very long music lyrics.
-      """
-    And 32 max tokens to predict
-    And streaming is enabled
-    And a user api key llama.cpp
-    Given concurrent OAI completions requests
-    Then the server is busy
-    And  all slots are busy
-    Then the server is idle
-    And  all slots are idle
-    Then all prompts are predicted
-
-  # FIXME: #3969 infinite loop on the CI, not locally, if n_prompt * n_predict > kv_size
-  @llama.cpp
-  Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size
-    Given a prompt:
-      """
-      Write a very long story about AI.
-      """
-    And a prompt:
-      """
-      Write another very long music lyrics.
-      """
-    And a prompt:
-      """
-      Write a very long poem.
-      """
-    And a prompt:
-      """
-      Write a very long joke.
-      """
-    And 512 max tokens to predict
-    And a user api key llama.cpp
-    Given concurrent completion requests
-    Then the server is busy
-    And  all slots are busy
-    Then the server is idle
-    And  all slots are idle
-    Then all prompts are predicted
-
-
-  @llama.cpp
   Scenario: Embedding
     When embeddings are computed for:
     """
     What is the capital of Bulgaria ?
     """
     Then embeddings are generated
 
-
-  @llama.cpp
   Scenario: OAI Embeddings compatibility
     Given a model tinyllama-2
     When an OAI compatible embeddings computation request for:
@@ -131,23 +57,9 @@ Feature: llama.cpp server
     Then embeddings are generated
 
 
-  @llama.cpp
   Scenario: Tokenize / Detokenize
     When tokenizing:
     """
     What is the capital of France ?
     """
     Then tokens can be detokenize
-
-  @llama.cpp
-  Scenario Outline: CORS Options
-    When an OPTIONS request is sent from <origin>
-    Then CORS header <cors_header> is set to <cors_header_value>
-
-    Examples: Headers
-      | origin          | cors_header                      | cors_header_value |
-      | localhost       | Access-Control-Allow-Origin      | localhost         |
-      | web.mydomain.fr | Access-Control-Allow-Origin      | web.mydomain.fr   |
-      | origin          | Access-Control-Allow-Credentials | true              |
-      | web.mydomain.fr | Access-Control-Allow-Methods     | POST              |
-      | web.mydomain.fr | Access-Control-Allow-Headers     | *                 |