-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : refactor #5882
server : refactor #5882
Conversation
nice @ggerganov, it is worth the effort. Please tell me which additional tests might help you to secure it. |
Here is a test case that fails on ./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8 diff --git a/examples/server-embd.py b/examples/server-embd.py
index c5c4ea87..118e0427 100644
--- a/examples/server-embd.py
+++ b/examples/server-embd.py
@@ -13,7 +13,7 @@ async def main():
model_url = "http://127.0.0.1:6900"
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
url= f"{model_url}/embedding",
- json= {"content": str(i)*1024}
+ json= {"content": str(0)*1024}
) for i in range(n)])
for response in responses: python3 ./examples/server-embd.py
[-0.691986083984375, -0.6994509100914001, -0.34556347131729126, -0.16304560005664825, -0.6425938606262207, 0.6009216904640198, 0.8178988099098206, -0.2026521861553192]
[-0.033909425139427185, -0.11941321194171906, -0.7864583730697632, -0.24582073092460632, -0.17017485201358795, 0.0760011374950409, 0.3244125545024872, -0.38041409850120544]
[-0.0339287631213665, -0.11942094564437866, -0.7864440083503723, -0.24585771560668945, -0.17014935612678528, 0.07596983015537262, 0.32442623376846313, -0.38041090965270996]
[-0.03392859548330307, -0.11942102015018463, -0.7864437699317932, -0.24585801362991333, -0.17014899849891663, 0.07596953213214874, 0.324426531791687, -0.380410760641098]
[-0.46300897002220154, -0.14666011929512024, -0.646159291267395, 0.10496828705072403, -0.528754711151123, 0.09720712900161743, 0.7127938866615295, 0.019480813294649124]
[-0.033925510942935944, -0.1194162666797638, -0.7864490151405334, -0.24582067131996155, -0.17019721865653992, 0.07600662112236023, 0.32440948486328125, -0.38042333722114563]
[-0.03392315283417702, -0.1194247156381607, -0.786466121673584, -0.2458271086215973, -0.1701907217502594, 0.07600513100624084, 0.3244108259677887, -0.38043472170829773]
[-0.35006847977638245, -0.09708067774772644, -0.7097396850585938, 0.1826569139957428, -0.36089810729026794, 0.19932880997657776, 0.8620818853378296, 0.2980068325996399]
Similarity between 0 and 1: 0.41
Similarity between 0 and 2: 0.41
Similarity between 0 and 3: 0.41
Similarity between 0 and 4: 0.76
Similarity between 0 and 5: 0.41
Similarity between 0 and 6: 0.41
Similarity between 0 and 7: 0.69
Similarity between 1 and 2: 1.00
Similarity between 1 and 3: 1.00
Similarity between 1 and 4: 0.47
Similarity between 1 and 5: 1.00
Similarity between 1 and 6: 1.00
Similarity between 1 and 7: 0.45
Similarity between 2 and 3: 1.00
Similarity between 2 and 4: 0.47
Similarity between 2 and 5: 1.00
Similarity between 2 and 6: 1.00
Similarity between 2 and 7: 0.45
Similarity between 3 and 4: 0.47
Similarity between 3 and 5: 1.00
Similarity between 3 and 6: 1.00
Similarity between 3 and 7: 0.45
Similarity between 4 and 5: 0.47
Similarity between 4 and 6: 0.47
Similarity between 4 and 7: 0.95
Similarity between 5 and 6: 1.00
Similarity between 5 and 7: 0.45
Similarity between 6 and 7: 0.45 It should return similarity of |
@ggerganov Could you please don't "remove multimodal capabilities" from server? Many blind users are relying on it for image description. |
@ggerganov Could you please upload on ggml-org/models a The one I found cannot tokenize anymore CompendiumLabs/embedding-models-english tokenize error to_lower(104)
|
@phymbert The prompt |
@ggerganov Thanks, yes but if you increase the KV Cache Size to 2048, it does not pass anymore. |
Can you find out why? It works for me |
slot.sparams.mirostat_tau = json_value(data, "mirostat_tau", default_sparams.mirostat_tau); | ||
slot.sparams.mirostat_eta = json_value(data, "mirostat_eta", default_sparams.mirostat_eta); | ||
slot.sparams.penalize_nl = json_value(data, "penalize_nl", default_sparams.penalize_nl); | ||
slot.params.n_keep = json_value(data, "n_keep", slot.params.n_keep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov For OAI Completions, "n_keep" is not set in json data, so it's always 0 and it triggers context shifthing with n_keep=1:
llama.cpp/examples/server/server.cpp
Line 1609 in 87a4a10
const int n_keep = slot.params.n_keep + add_bos_token; |
llama.cpp/examples/server/server.cpp
Line 1749 in 87a4a10
if (slot.params.n_keep < 0) { |
Or something I dont understand ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes - likely this has to default to default_params.n_keep
It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment)) |
It looks like server.exe wouldn't open after this pr with rocm on windows
|
* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>
...aaand my library of archived llama.cpp versions just gained another entry |
* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>
* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>
Hmm wouldn't it be wiser to implement it properly before removing it. |
I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...? |
Sorry for that, but the previous implementation was making impossible to refactor the code properly and it was causing performance issue, so this temporary removal is for the good of the server adoption. The removal has been tracked in:
Feel free to contribute. |
For @chigkim, and anyone else with a use case requiring multimodal capabilities e.g. image descriptions for blind users: not exactly a workaround, but the last release supporting this seems to be b2356 so you can link to that in your app's documentation which is what I'm having to do in the meantime. |
* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <[email protected]>
Really waiting for multimodal capabilities back! And update example docs please, it still mentions --mmproj |
Also removed LLM logic from alerts_controller.go. Note that llama.cpp's server does not currently support multimodal requests: ggerganov/llama.cpp#5882
ref #4216
Moved the code around to logically similar things closer together and did some renaming. The
cache_tokens
management should be improved - it's now updated only whenparams.cache_tokens == true
. It was also misused to count the number of tokens that have been processed - usen_past
instead.The context shift and self-extend logic is still very ugly - need to simplify this, but will be done in another PR because changes are getting too much in this one. Thinking about merging it within
llama_sampling_context
, but thesystem_prompt
is making things difficult. Probably a first step would be to simplify thesystem_prompt
management and then try to refactor the restcommon
and reuse across examplesn_batch
tokens per iter. Need better slot managing logic. This will fix batched embeddings and long-blocking requests