You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
text-generation-launcher 3.1.1-dev0
Single RTX 4070 S GPU
NVIDIA-SMI 572.16 Driver Version: 572.16 CUDA Version: 12.8
Models Used : meta-llama/Llama-3.1-8B-Instruct, Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ, Yujivus/Phi-4-Health-CoT-1.1-AWQ
Docker Command:
docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq
(Quantize awq is used for only Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ and Yujivus/Phi-4-Health-CoT-1.1-AWQ models. For meta-llama/Llama-3.1-8B-Instruct, I tried eetq, bitsandbytes-nf4 and bitsandbytes-fp4. None of them worked.)
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
I am running simple function for inference :
async def get_response(query):
"""Send a single request to the TGI server."""
try:
async with AsyncInferenceClient(base_url="http://tgi-server:80") as client:
output = await client.chat.completions.create(
model="tgi",
messages=[
{
"role": "system",
"content": "You are a helpful assistant\n\n",
},
{
"role": "user",
"content": query,
},
],
stream=False,
max_tokens=3000,
)
return output.choices[0].message.content
except Exception as e:
print(f"Error: {e}")
return None
Error : 2025-02-05 16:11:02 2025-02-05T13:11:02.646154Z ERROR text_generation_launcher: Method Decode encountered an error.
2025-02-05 16:11:02 Traceback (most recent call last):
2025-02-05 16:11:02 File "/opt/conda/bin/text-generation-server", line 10, in
2025-02-05 16:11:02 sys.exit(app())
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in call
2025-02-05 16:11:02 return get_command(self)(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in call
2025-02-05 16:11:02 return self.main(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
2025-02-05 16:11:02 return _main(
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
2025-02-05 16:11:02 rv = self.invoke(ctx)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
2025-02-05 16:11:02 return _process_result(sub_ctx.command.invoke(sub_ctx))
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
2025-02-05 16:11:02 return ctx.invoke(self.callback, **ctx.params)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
2025-02-05 16:11:02 return __callback(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
2025-02-05 16:11:02 return callback(**use_params)
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
2025-02-05 16:11:02 server.serve(
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/server.py", line 315, in serve
2025-02-05 16:11:02 asyncio.run(
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
2025-02-05 16:11:02 return runner.run(main)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
2025-02-05 16:11:02 return self._loop.run_until_complete(task)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
2025-02-05 16:11:02 self.run_forever()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
2025-02-05 16:11:02 self._run_once()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
2025-02-05 16:11:02 handle._run()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
2025-02-05 16:11:02 self._context.run(self._callback, *self._args)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2025-02-05 16:11:02 return await self.intercept(
2025-02-05 16:11:02 > File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
2025-02-05 16:11:02 return await response
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
2025-02-05 16:11:02 raise error
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
2025-02-05 16:11:02 return await behavior(request_or_iterator, context)
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/server.py", line 221, in Decode
2025-02-05 16:11:02 return generate_pb2.DecodeResponse(
2025-02-05 16:11:02 ValueError: Value out of range: -29146814772
2025-02-05 16:11:02 2025-02-05T13:11:02.646395Z ERROR batch{batch_size=1}:decode:decode{size=1}:decode{size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Value out of range: -29146814772
2025-02-05 16:11:02 2025-02-05T13:11:02.647196Z ERROR chat_completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: Value out of range: -29146814772
Only 40-50% of the requests to TGI return a successful response (for Yujivus/Phi-4-Health-CoT-1.1-AWQ), while the rest fail. I tried with several models but I encountered same error for all of them except smaller llama 3.2 1b and 3b models. I observed the highest success rate was when using the EETQ quantization method with the LLaMA 3.1 8B model. But still, I encountered this error oftenly.
The Phi-4 model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset. Since this dataset is CoT dataset, the model is talking too much. And I get more errors. If response takes longer tokens, I get more errors. Probably because for every token, there is a change for getting value of range error. It is maybe because of quantization, but to test this, I tried Llama 3.2 1b model with and without quantization and I didn't encounter this error. But still, maybe the error occurs for larger models, which perform much more computation, during the decode phase because quantization causes the values to exceed the expected range.
I tried with smaller max_tokens but probability of getting error did not change. Error occurs roughly for 1/2000 tokens.
If there is a way to fix, it would be good to know. Thanks.
System Info
text-generation-launcher 3.1.1-dev0
Single RTX 4070 S GPU
NVIDIA-SMI 572.16 Driver Version: 572.16 CUDA Version: 12.8
Models Used : meta-llama/Llama-3.1-8B-Instruct, Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ, Yujivus/Phi-4-Health-CoT-1.1-AWQ
Docker Command:
docker run --name tgi-server --gpus all -p 80:81 --network tgi -v volume:/data --env HUGGING_FACE_HUB_TOKEN=... ghcr.io/huggingface/text-generation-inference:latest --model-id Yujivus/Phi-4-Health-CoT-1.1-AWQ --quantize awq
(Quantize awq is used for only Yujivus/DeepSeek-R1-Distill-Llama-8B-AWQ and Yujivus/Phi-4-Health-CoT-1.1-AWQ models. For meta-llama/Llama-3.1-8B-Instruct, I tried eetq, bitsandbytes-nf4 and bitsandbytes-fp4. None of them worked.)
Information
Tasks
Reproduction
I am running simple function for inference :
async def get_response(query):
"""Send a single request to the TGI server."""
try:
async with AsyncInferenceClient(base_url="http://tgi-server:80") as client:
output = await client.chat.completions.create(
model="tgi",
messages=[
{
"role": "system",
"content": "You are a helpful assistant\n\n",
},
{
"role": "user",
"content": query,
},
],
stream=False,
max_tokens=3000,
)
return output.choices[0].message.content
except Exception as e:
print(f"Error: {e}")
return None
Error : 2025-02-05 16:11:02 2025-02-05T13:11:02.646154Z ERROR text_generation_launcher: Method Decode encountered an error.
2025-02-05 16:11:02 Traceback (most recent call last):
2025-02-05 16:11:02 File "/opt/conda/bin/text-generation-server", line 10, in
2025-02-05 16:11:02 sys.exit(app())
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in call
2025-02-05 16:11:02 return get_command(self)(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in call
2025-02-05 16:11:02 return self.main(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
2025-02-05 16:11:02 return _main(
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
2025-02-05 16:11:02 rv = self.invoke(ctx)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
2025-02-05 16:11:02 return _process_result(sub_ctx.command.invoke(sub_ctx))
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
2025-02-05 16:11:02 return ctx.invoke(self.callback, **ctx.params)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
2025-02-05 16:11:02 return __callback(*args, **kwargs)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
2025-02-05 16:11:02 return callback(**use_params)
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
2025-02-05 16:11:02 server.serve(
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/server.py", line 315, in serve
2025-02-05 16:11:02 asyncio.run(
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
2025-02-05 16:11:02 return runner.run(main)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
2025-02-05 16:11:02 return self._loop.run_until_complete(task)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
2025-02-05 16:11:02 self.run_forever()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
2025-02-05 16:11:02 self._run_once()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
2025-02-05 16:11:02 handle._run()
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
2025-02-05 16:11:02 self._context.run(self._callback, *self._args)
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2025-02-05 16:11:02 return await self.intercept(
2025-02-05 16:11:02 > File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
2025-02-05 16:11:02 return await response
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
2025-02-05 16:11:02 raise error
2025-02-05 16:11:02 File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
2025-02-05 16:11:02 return await behavior(request_or_iterator, context)
2025-02-05 16:11:02 File "/usr/src/server/text_generation_server/server.py", line 221, in Decode
2025-02-05 16:11:02 return generate_pb2.DecodeResponse(
2025-02-05 16:11:02 ValueError: Value out of range: -29146814772
2025-02-05 16:11:02 2025-02-05T13:11:02.646395Z ERROR batch{batch_size=1}:decode:decode{size=1}:decode{size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Value out of range: -29146814772
2025-02-05 16:11:02 2025-02-05T13:11:02.647196Z ERROR chat_completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: Value out of range: -29146814772
Only 40-50% of the requests to TGI return a successful response (for Yujivus/Phi-4-Health-CoT-1.1-AWQ), while the rest fail. I tried with several models but I encountered same error for all of them except smaller llama 3.2 1b and 3b models. I observed the highest success rate was when using the EETQ quantization method with the LLaMA 3.1 8B model. But still, I encountered this error oftenly.
The Phi-4 model is fine tuned with FreedomIntelligence/medical-o1-reasoning-SFT dataset. Since this dataset is CoT dataset, the model is talking too much. And I get more errors. If response takes longer tokens, I get more errors. Probably because for every token, there is a change for getting value of range error. It is maybe because of quantization, but to test this, I tried Llama 3.2 1b model with and without quantization and I didn't encounter this error. But still, maybe the error occurs for larger models, which perform much more computation, during the decode phase because quantization causes the values to exceed the expected range.
I tried with smaller max_tokens but probability of getting error did not change. Error occurs roughly for 1/2000 tokens.
If there is a way to fix, it would be good to know. Thanks.
Expected behavior
2025-02-05 16:49:39 2025-02-05T13:49:39.259073Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 1045
2025-02-05 16:49:54 2025-02-05T13:49:54.901547Z INFO chat_completions{total_time="15.645054793s" validation_time="2.504013ms" queue_time="550.086µs" inference_time="15.642000896s" time_per_token="21.876924ms" seed="Some(15551805739611344766)"}: text_generation_router::server: router/src/server.rs:625: Success
The text was updated successfully, but these errors were encountered: