Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request failed during generation: Server error: Value out of range: -29146814772 #2994

Open
2 of 4 tasks
AlperYildirim1 opened this issue Feb 5, 2025 · 0 comments
Open
2 of 4 tasks

Comments

@AlperYildirim1
Copy link

System Info

text-generation-launcher 3.1.1-dev0
Single RTX 4070 S GPU
NVIDIA-SMI 572.16 Driver Version: 572.16 CUDA Version: 12.8
Models Used : meta-llama/Llama-3.1-8B-Instruct, Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ, Yujivus/Phi-4-Health-CoT-1.1-AWQ

Docker Command:
docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq
(Quantize awq is used for only Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ and Yujivus/Phi-4-Health-CoT-1.1-AWQ models. For meta-llama/Llama-3.1-8B-Instruct, I tried eetq, bitsandbytes-nf4 and bitsandbytes-fp4. None of them worked.)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I am running simple function for inference :

async def get_response(query):
"""Send a single request to the TGI server."""
try:
async with AsyncInferenceClient(base_url="http://tgi-server:80") as client:
output = await client.chat.completions.create(
model="tgi",
messages=[
{
"role": "system",
"content": "You are a helpful assistant\n\n",
},
{
"role": "user",
"content": query,
},
],
stream=False,
max_tokens=3000,
)
return output.choices[0].message.content
except Exception as e:
print(f"Error: {e}")
return None

Error : 2025-02-05 16:11:02 2025-02-05T13:11:02.646154Z ERROR text_generation_launcher: Method Decode encountered an error.
2025-02-05 16:11:02 Traceback (most recent call last):
2025-02-05 16:11:02 File "/opt/conda/bin/text-generation-server", line 10, in
2025-02-05 16:11:02 sys.exit(app())
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in call
2025-02-05 16:11:02 return get_command(self)(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in call
2025-02-05 16:11:02 return self.main(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
2025-02-05 16:11:02 return _main(
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
2025-02-05 16:11:02 rv = self.invoke(ctx)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
2025-02-05 16:11:02 return _process_result(sub_ctx.command.invoke(sub_ctx))
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
2025-02-05 16:11:02 return ctx.invoke(self.callback, **ctx.params)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
2025-02-05 16:11:02 return __callback(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
2025-02-05 16:11:02 return callback(**use_params)
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
2025-02-05 16:11:02 server.serve(
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/server.py", line 315, in serve
2025-02-05 16:11:02 asyncio.run(
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
2025-02-05 16:11:02 return runner.run(main)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
2025-02-05 16:11:02 return self._loop.run_until_complete(task)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
2025-02-05 16:11:02 self.run_forever()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
2025-02-05 16:11:02 self._run_once()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
2025-02-05 16:11:02 handle._run()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
2025-02-05 16:11:02 self._context.run(self._callback, *self._args)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2025-02-05 16:11:02 return await self.intercept(
2025-02-05 16:11:02 > File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
2025-02-05 16:11:02 return await response
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
2025-02-05 16:11:02 raise error
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
2025-02-05 16:11:02 return await behavior(request_or_iterator, context)
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/server.py", line 221, in Decode
2025-02-05 16:11:02 return generate_pb2.DecodeResponse(
2025-02-05 16:11:02 ValueError: Value out of range: -29146814772
2025-02-05 16:11:02 2025-02-05T13:11:02.646395Z ERROR batch{batch_size=1}:decode:decode{size=1}:decode{size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Value out of range: -29146814772
2025-02-05 16:11:02 2025-02-05T13:11:02.647196Z ERROR chat_completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: Value out of range: -29146814772

Only 40-50% of the requests to TGI return a successful response (for Yujivus/Phi-4-Health-CoT-1.1-AWQ), while the rest fail. I tried with several models but I encountered same error for all of them except smaller llama 3.2 1b and 3b models. I observed the highest success rate was when using the EETQ quantization method with the LLaMA 3.1 8B model. But still, I encountered this error oftenly.
The Phi-4 model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset. Since this dataset is CoT dataset, the model is talking too much. And I get more errors. If response takes longer tokens, I get more errors. Probably because for every token, there is a change for getting value of range error. It is maybe because of quantization, but to test this, I tried Llama 3.2 1b model with and without quantization and I didn't encounter this error. But still, maybe the error occurs for larger models, which perform much more computation, during the decode phase because quantization causes the values to exceed the expected range.
I tried with smaller max_tokens but probability of getting error did not change. Error occurs roughly for 1/2000 tokens.
If there is a way to fix, it would be good to know. Thanks.

Expected behavior

2025-02-05 16:49:39 2025-02-05T13:49:39.259073Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 1045
2025-02-05 16:49:54 2025-02-05T13:49:54.901547Z INFO chat_completions{total_time="15.645054793s" validation_time="2.504013ms" queue_time="550.086µs" inference_time="15.642000896s" time_per_token="21.876924ms" seed="Some(15551805739611344766)"}: text_generation_router::server: router/src/server.rs:625: Success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant