The vLLM backend supports sending additional outputs from vLLM on top of the
usual text_output
when requested.
All additional outputs are disabled by default and they need to be enabled on a per-request basis. If enabled, the corresponding output tensor will be set for all responses from the request.
The reason why the sequence is finished. See here for more details.
To enable, set return_finish_reason
input tensor to True
. The reason will be
sent as a string on the finish_reason
output tensor.
Supported since r24.12.
The cumulative log probability of the generated output text. See here for more details.
To enable, set return_cumulative_logprob
input tensor to True
. The floating
point value will be sent on the cumulative_logprob
output tensor.
Supported since r24.12.
The number of token IDs of the generated output text sent on this response. It is the difference in length of the token IDs generated from the last response to this response. If this is the first response, the last response length is presumed to be zero. See here for more details on the token IDs of the generated output text.
To enable, set return_num_output_tokens
input tensor to True
. The unsigned
integer value will be sent on the num_output_tokens
output tensor.
Supported since r24.12.
import numpy as np
import tritonclient.grpc as grpcclient
inputs = []
inputs.append(grpcclient.InferInput("text_input", [1], "BYTES"))
inputs[-1].set_data_from_numpy(
np.array(["example prompt".encode("utf-8")], dtype=np.object_)
)
inputs.append(grpcclient.InferInput("return_finish_reason", [1], "BOOL"))
inputs[-1].set_data_from_numpy(np.array([True], dtype=bool))
def callback(result, error):
...
print(result.as_numpy(name="finish_reason"))
with grpcclient.InferenceServerClient("localhost:8001") as client:
client.start_stream(callback)
client.async_stream_infer("vLLM_model_name", inputs=inputs, ...)
client.stop_stream()
- Enabling additional outputs may impact performance, only add additional outputs when necessary.