server : refactor #5882

ggerganov · 2024-03-05T09:25:07Z

Moved the code around to logically similar things closer together and did some renaming. The cache_tokens management should be improved - it's now updated only when params.cache_tokens == true. It was also misused to count the number of tokens that have been processed - use n_past instead.

The context shift and self-extend logic is still very ugly - need to simplify this, but will be done in another PR because changes are getting too much in this one. Thinking about merging it within llama_sampling_context, but the system_prompt is making things difficult. Probably a first step would be to simplify the system_prompt management and then try to refactor the rest

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future
(future PR) Refactor context shift + self-extend into common and reuse across examples
Improve main loop batching - do not process more than n_batch tokens per iter. Need better slot managing logic. This will fix batched embeddings and long-blocking requests
Fix the code style
Disable prompt caching when self-extend is enabled

phymbert · 2024-03-05T14:23:53Z

nice @ggerganov, it is worth the effort. Please tell me which additional tests might help you to secure it.

ggerganov · 2024-03-05T16:31:01Z

Here is a test case that fails on master and I am looking to fix in this PR:

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8

diff --git a/examples/server-embd.py b/examples/server-embd.py
index c5c4ea87..118e0427 100644
--- a/examples/server-embd.py
+++ b/examples/server-embd.py
@@ -13,7 +13,7 @@ async def main():
     model_url = "http://127.0.0.1:6900"
     responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
         url= f"{model_url}/embedding",
-        json= {"content": str(i)*1024}
+        json= {"content": str(0)*1024}
     ) for i in range(n)])
 
     for response in responses:

python3 ./examples/server-embd.py
[-0.691986083984375, -0.6994509100914001, -0.34556347131729126, -0.16304560005664825, -0.6425938606262207, 0.6009216904640198, 0.8178988099098206, -0.2026521861553192]
[-0.033909425139427185, -0.11941321194171906, -0.7864583730697632, -0.24582073092460632, -0.17017485201358795, 0.0760011374950409, 0.3244125545024872, -0.38041409850120544]
[-0.0339287631213665, -0.11942094564437866, -0.7864440083503723, -0.24585771560668945, -0.17014935612678528, 0.07596983015537262, 0.32442623376846313, -0.38041090965270996]
[-0.03392859548330307, -0.11942102015018463, -0.7864437699317932, -0.24585801362991333, -0.17014899849891663, 0.07596953213214874, 0.324426531791687, -0.380410760641098]
[-0.46300897002220154, -0.14666011929512024, -0.646159291267395, 0.10496828705072403, -0.528754711151123, 0.09720712900161743, 0.7127938866615295, 0.019480813294649124]
[-0.033925510942935944, -0.1194162666797638, -0.7864490151405334, -0.24582067131996155, -0.17019721865653992, 0.07600662112236023, 0.32440948486328125, -0.38042333722114563]
[-0.03392315283417702, -0.1194247156381607, -0.786466121673584, -0.2458271086215973, -0.1701907217502594, 0.07600513100624084, 0.3244108259677887, -0.38043472170829773]
[-0.35006847977638245, -0.09708067774772644, -0.7097396850585938, 0.1826569139957428, -0.36089810729026794, 0.19932880997657776, 0.8620818853378296, 0.2980068325996399]
Similarity between 0 and 1: 0.41
Similarity between 0 and 2: 0.41
Similarity between 0 and 3: 0.41
Similarity between 0 and 4: 0.76
Similarity between 0 and 5: 0.41
Similarity between 0 and 6: 0.41
Similarity between 0 and 7: 0.69
Similarity between 1 and 2: 1.00
Similarity between 1 and 3: 1.00
Similarity between 1 and 4: 0.47
Similarity between 1 and 5: 1.00
Similarity between 1 and 6: 1.00
Similarity between 1 and 7: 0.45
Similarity between 2 and 3: 1.00
Similarity between 2 and 4: 0.47
Similarity between 2 and 5: 1.00
Similarity between 2 and 6: 1.00
Similarity between 2 and 7: 0.45
Similarity between 3 and 4: 0.47
Similarity between 3 and 5: 1.00
Similarity between 3 and 6: 1.00
Similarity between 3 and 7: 0.45
Similarity between 4 and 5: 0.47
Similarity between 4 and 6: 0.47
Similarity between 4 and 7: 0.95
Similarity between 5 and 6: 1.00
Similarity between 5 and 7: 0.45
Similarity between 6 and 7: 0.45

It should return similarity of 1.00 for all pairs. But currently it fails because the logic in update_slots() truncates and splits the input prompts. This is incorrect - they have to be processed in whole when using embedding models such as BERT. If the prompt does not fit in the batch size (n_batch) we should return some error

chigkim · 2024-03-06T11:59:17Z

@ggerganov Could you please don't "remove multimodal capabilities" from server? Many blind users are relying on it for image description.
Thanks so much!!!

phymbert · 2024-03-06T16:36:10Z

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8

@ggerganov Could you please upload on ggml-org/models a bert-bge-small ?

The one I found cannot tokenize anymore CompendiumLabs/embedding-models-english

tokenize error to_lower(104)

__pthread_kill_implementation 0x00007fffeee99a1b
__pthread_kill_internal 0x00007fffeee99a1b
__GI___pthread_kill 0x00007fffeee99a1b
__GI_raise 0x00007fffeee428e6
__GI_abort 0x00007fffeee268b7
<unknown> 0x00007fffef2a4f06
<unknown> 0x00007fffef2b6e6c
std::terminate() 0x00007fffef2b6ed7
__cxa_throw 0x00007fffef2b7138
std::__throw_runtime_error(char const*) 0x00007fffef2a8338
std::locale::facet::_S_create_c_locale(__locale_struct*&, char const*, __locale_struct*) 0x00007fffef2db108
std::locale::_Impl::_Impl(char const*, unsigned long) 0x00007fffef2cc9a5
std::locale::locale(char const*) 0x00007fffef2cd585
llm_tokenizer_wpm::to_lower llama.cpp:9355
llm_tokenizer_wpm::preprocess llama.cpp:9319
llm_tokenizer_wpm::tokenize llama.cpp:9246
llama_tokenize_internal llama.cpp:9593
llama_tokenize llama.cpp:13309
llama_tokenize common.cpp:1407
llama_tokenize common.cpp:1396
llama_server_context::tokenize server.cpp:515
llama_server_context::update_slots server.cpp:1719
std::__invoke_impl<…> invoke.h:74
std::__invoke<…> invoke.h:96
std::_Bind<bool (llama_server_context::*(llama_server_context*))()>::__call<bool, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) functional:506
std::_Bind<bool (llama_server_context::*(llama_server_context*))()>::operator()<, bool>() functional:591
std::__invoke_impl<…>(std::__invoke_other, std::_Bind<…> &) invoke.h:61
std::__invoke_r<…>(std::_Bind<…> &) invoke.h:150
std::_Function_handler::_M_invoke(const std::_Any_data &) std_function.h:290
std::function::operator()() const std_function.h:591
llama_server_queue::start_loop utils.hpp:315
main server.cpp:3552
__libc_start_call_main 0x00007fffeee28150
__libc_start_main_impl 0x00007fffeee28209
_start 0x00005555555678a5

ggerganov · 2024-03-06T16:55:11Z

Here is one: https://huggingface.co/ggml-org/models/tree/main/bert-bge-small

ggerganov · 2024-03-06T19:41:33Z

@phymbert The prompt str(0)*1024 tokenizes to 513 tokens so I bumped the batch size in the embeddings tests to 1024, otherwise the prompts are not processed because they don't fit entirely in the batch. The tests pass on my local machine - let's see if it works in the CI

phymbert · 2024-03-06T19:43:13Z

@phymbert The prompt str(0)*1024 tokenizes to 513 tokens so I bumped the batch size in the embeddings tests to 1024, otherwise the prompts are not processed because they don't fit entirely in the batch. The tests pass on my local machine - let's see if it works in the CI

@ggerganov Thanks, yes but if you increase the KV Cache Size to 2048, it does not pass anymore.

ggerganov · 2024-03-06T19:52:13Z

Can you find out why? It works for me

phymbert · 2024-03-07T09:41:58Z

examples/server/server.cpp

+        slot.sparams.mirostat_tau      = json_value(data, "mirostat_tau",      default_sparams.mirostat_tau);
+        slot.sparams.mirostat_eta      = json_value(data, "mirostat_eta",      default_sparams.mirostat_eta);
+        slot.sparams.penalize_nl       = json_value(data, "penalize_nl",       default_sparams.penalize_nl);
+        slot.params.n_keep             = json_value(data, "n_keep",            slot.params.n_keep);


@ggerganov For OAI Completions, "n_keep" is not set in json data, so it's always 0 and it triggers context shifthing with n_keep=1:

llama.cpp/examples/server/server.cpp

Line 1609 in 87a4a10

const int n_keep = slot.params.n_keep + add_bos_token;

llama.cpp/examples/server/server.cpp

Line 1749 in 87a4a10

if (slot.params.n_keep < 0) {

Or something I dont understand ?

Hm, yes - likely this has to default to default_params.n_keep

ggerganov · 2024-03-07T09:43:44Z

It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment))

sorasoras · 2024-03-08T16:06:38Z

It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment))

It looks like server.exe wouldn't open after this pr with rocm on windows
There is no response

cmake .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON  -DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/5.7/bin/clang.exe" -DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/5.7/bin/clang++.exe" -DAMDGPU_TARGETS="gfx1100;gfx1030;gfx1031"

 cmake --build . --config Release

@ggerganov

* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>

whoreson · 2024-03-12T12:51:58Z

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

...aaand my library of archived llama.cpp versions just gained another entry

* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>

Kreijstal · 2024-03-18T10:24:18Z

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

Hmm wouldn't it be wiser to implement it properly before removing it.

Dampfinchen · 2024-03-22T08:07:00Z

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

...aaand my library of archived llama.cpp versions just gained another entry

I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...?

phymbert · 2024-03-22T08:41:43Z

I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...?

Sorry for that, but the previous implementation was making impossible to refactor the code properly and it was causing performance issue, so this temporary removal is for the good of the server adoption. The removal has been tracked in:

Feel free to contribute.

cartertemm · 2024-03-25T00:15:27Z

For @chigkim, and anyone else with a use case requiring multimodal capabilities e.g. image descriptions for blind users: not exactly a workaround, but the last release supporting this seems to be b2356 so you can link to that in your app's documentation which is what I'm having to do in the meantime.

* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>

GrigoryEvko · 2024-04-10T23:38:23Z

Really waiting for multimodal capabilities back!

And update example docs please, it still mentions --mmproj

Also removed LLM logic from alerts_controller.go. Note that llama.cpp's server does not currently support multimodal requests: ggerganov/llama.cpp#5882

ggerganov added 6 commits March 5, 2024 11:16

server : refactoring (wip)

f4e6e7e

server : remove llava/clip objects from build

ef7eb33

server : fix empty prompt handling + all slots idle logic

134f5fe

server : normalize id vars

ad1d746

server : code style

fef64c5

server : simplify model chat template validation

b1b3ba8

ggerganov added 3 commits March 5, 2024 16:42

server : code style

f4800d5

server : minor

7635b13

llama : llama_chat_apply_template support null buf

f84809b

ngxson mentioned this pull request Mar 5, 2024

server: maintain chat completion id for streaming responses #5880

Closed

ggerganov added 4 commits March 5, 2024 18:58

server : do not process embedding requests when disabled

22ae1a6

server : reorganize structs and enums + naming fixes

cb3ce0b

server : merge oai.hpp in utils.hpp

4a2d5f6

server : refactor system prompt update at start

61b6370

This was referenced Mar 6, 2024

server: multimodal - fix misreported prompt and num prompt tokens #5896

Closed

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

Closed

ngxson mentioned this pull request Mar 6, 2024

Inconsistent Bert Embedding output from embedding.cpp vs llama.cpp server #5801

Closed

server : disable cached prompts with self-extend

aef02b1

cjpais mentioned this pull request Mar 6, 2024

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

Closed

ggerganov and others added 3 commits March 6, 2024 21:17

server : do not process more than n_batch tokens per iter

bfb121f

server: tests: embeddings use a real embeddings model (#5908)

79ef3c0

server, tests : bump batch to fit 1 embedding prompt

36e12f8

ggerganov deleted the gg/refactor-server branch March 7, 2024 09:41

phymbert reviewed Mar 7, 2024

View reviewed changes

ngxson mentioned this pull request Mar 8, 2024

server: error handling #5776

Closed

ngxson mentioned this pull request Mar 9, 2024

Server: format error to json #5961

Merged

sorasoras mentioned this pull request Mar 9, 2024

Server stop working after recent refactor on Windows #5963

Closed

phymbert mentioned this pull request Mar 12, 2024

server : improvements and maintenance #4216

Open

10 tasks

chigkim mentioned this pull request Mar 12, 2024

LLaVA 1.6 Models Unable to Process Specific Image Size and Resolution Locally ollama/ollama#2429

Open

rezacopol mentioned this pull request Mar 19, 2024

Bring back multimodal support for server #6168

Closed

Dampfinchen mentioned this pull request Mar 22, 2024

Unable to assign mmproj value when running docker #6226

Closed

mscheong01 mentioned this pull request Mar 26, 2024

Add multimodal example #6313

Closed

Jeximo mentioned this pull request May 7, 2024

Server: Multimodal Model Input Parameter No longer Exists #7112

Closed

goodnight654 mentioned this pull request May 13, 2024

llama.cpp delete the Multimodal support lxe/llavavision#3

Open

nischalj10 mentioned this pull request Jun 14, 2024

power with llama cpp under the hood instead of ollama nischalj10/local-recall#6

Open

ngxson mentioned this pull request Jun 19, 2024

server: Bring back multimodal support #8010

Open

nischalj10 mentioned this pull request Jul 23, 2024

add support for llama.cpp local server nischalj10/headless-ollama#1

Open

abenmrad mentioned this pull request Sep 16, 2024

Support Mistral's new visual model: Pixtral-12b-240910 ollama/ollama#6748

Open

HanClinto mentioned this pull request Sep 26, 2024

Llama-3.2 11B Vision Support #9643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : refactor #5882

server : refactor #5882

ggerganov commented Mar 5, 2024 •

edited

Loading

phymbert commented Mar 5, 2024 •

edited

Loading

ggerganov commented Mar 5, 2024

chigkim commented Mar 6, 2024

phymbert commented Mar 6, 2024

ggerganov commented Mar 6, 2024

ggerganov commented Mar 6, 2024

phymbert commented Mar 6, 2024 •

edited

Loading

ggerganov commented Mar 6, 2024

phymbert Mar 7, 2024 •

edited

Loading

ggerganov Mar 7, 2024

ggerganov commented Mar 7, 2024

sorasoras commented Mar 8, 2024 •

edited

Loading

whoreson commented Mar 12, 2024

Kreijstal commented Mar 18, 2024

Dampfinchen commented Mar 22, 2024

phymbert commented Mar 22, 2024

cartertemm commented Mar 25, 2024 •

edited

Loading

GrigoryEvko commented Apr 10, 2024

server : refactor #5882

server : refactor #5882

Conversation

ggerganov commented Mar 5, 2024 • edited Loading

phymbert commented Mar 5, 2024 • edited Loading

ggerganov commented Mar 5, 2024

chigkim commented Mar 6, 2024

phymbert commented Mar 6, 2024

ggerganov commented Mar 6, 2024

ggerganov commented Mar 6, 2024

phymbert commented Mar 6, 2024 • edited Loading

ggerganov commented Mar 6, 2024

phymbert Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Mar 7, 2024

Choose a reason for hiding this comment

ggerganov commented Mar 7, 2024

sorasoras commented Mar 8, 2024 • edited Loading

whoreson commented Mar 12, 2024

Kreijstal commented Mar 18, 2024

Dampfinchen commented Mar 22, 2024

phymbert commented Mar 22, 2024

cartertemm commented Mar 25, 2024 • edited Loading

GrigoryEvko commented Apr 10, 2024

ggerganov commented Mar 5, 2024 •

edited

Loading

phymbert commented Mar 5, 2024 •

edited

Loading

phymbert commented Mar 6, 2024 •

edited

Loading

phymbert Mar 7, 2024 •

edited

Loading

sorasoras commented Mar 8, 2024 •

edited

Loading

cartertemm commented Mar 25, 2024 •

edited

Loading