Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGIS metrics #18

Merged
merged 5 commits into from
Apr 18, 2024
Merged

TGIS metrics #18

merged 5 commits into from
Apr 18, 2024

Conversation

joerunde
Copy link
Collaborator

@joerunde joerunde commented Apr 16, 2024

This PR implements a subset of the metrics from the TGIS image. I tried to make sure that everything from our current ops dashboard is supported. These are:

  • tgi_tokenize_request_tokens
  • tgi_tokenize_request_input_count
  • tgi_request_input_count
  • tgi_request_failure
  • tgi_request_queue_duration
  • tgi_queue_size
  • tgi_batch_current_size
  • tgi_batch_inference_duration
  • tgi_request_input_length
  • tgi_request_generated_tokens

I colocated all the tgis metrics code in tgis_utils/metrics.py to make it easy to find and change. The metrics are reported either

  • Directly by our code in grpc_server.py for data that's easily available at the grpc server level and not currently covered by the vllm StatLogger, or
  • By a TGISStatLogger that wraps the vllm engine's StatLogger and is injected into the engine

The token length metrics depend on the open PR here: vllm-project/vllm#2764. (But the rest of the metrics will function without those changes)
They could have been implemented right in the grpc server, but I wanted to keep this metrics reporting aligned with those upcoming changes.

@joerunde
Copy link
Collaborator Author

The metrics look pretty good, the only things that I can't trigger atm are:

  • vLLM never seems to report any requests in the queue, even when I try limiting the max batched tokens
  • I still need to see if I can squash 2764 in a test build so that I can get the request/response length metrics tested.

Output from /metrics/

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7522.0
python_gc_objects_collected_total{generation="1"} 4938.0
python_gc_objects_collected_total{generation="2"} 653.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 3860.0
python_gc_collections_total{generation="1"} 349.0
python_gc_collections_total{generation="2"} 52.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="11",patchlevel="8",version="3.11.8"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.5157687296e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 8.25778176e+09
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71338491607e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 113.97999999999999
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 75.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP tgi_tokenize_request_tokens Histogram of tokenized tokens per tokenize request
# TYPE tgi_tokenize_request_tokens histogram
tgi_tokenize_request_tokens_bucket{le="64.0"} 6.0
tgi_tokenize_request_tokens_bucket{le="128.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="256.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="512.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="1024.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="2048.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="4096.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="8192.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="16384.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="32768.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="65536.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="131072.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="262144.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="524288.0"} 12.0
tgi_tokenize_request_tokens_bucket{le="+Inf"} 12.0
tgi_tokenize_request_tokens_count 12.0
tgi_tokenize_request_tokens_sum 786.0
# HELP tgi_tokenize_request_input_count_total Count of tokenize request inputs (batch of n counts as n)
# TYPE tgi_tokenize_request_input_count_total counter
tgi_tokenize_request_input_count_total 12.0
# HELP tgi_request_input_count_total Count of generate request inputs (batch of n counts as n)
# TYPE tgi_request_input_count_total counter
tgi_request_input_count_total 215.0
# HELP tgi_request_failure_total Count of failed requests, segmented by error type
# TYPE tgi_request_failure_total counter
# HELP tgi_request_queue_duration Request time spent in queue (in seconds)
# TYPE tgi_request_queue_duration histogram
tgi_request_queue_duration_bucket{le="0.001"} 2271.0
tgi_request_queue_duration_bucket{le="0.002"} 9375.0
tgi_request_queue_duration_bucket{le="0.005"} 44413.0
tgi_request_queue_duration_bucket{le="0.01"} 99595.0
tgi_request_queue_duration_bucket{le="0.02"} 178054.0
tgi_request_queue_duration_bucket{le="0.05"} 186893.0
tgi_request_queue_duration_bucket{le="0.1"} 189965.0
tgi_request_queue_duration_bucket{le="0.2"} 189965.0
tgi_request_queue_duration_bucket{le="0.5"} 190989.0
tgi_request_queue_duration_bucket{le="1.0"} 190989.0
tgi_request_queue_duration_bucket{le="2.0"} 190989.0
tgi_request_queue_duration_bucket{le="5.0"} 190989.0
tgi_request_queue_duration_bucket{le="10.0"} 190989.0
tgi_request_queue_duration_bucket{le="20.0"} 190989.0
tgi_request_queue_duration_bucket{le="50.0"} 190989.0
tgi_request_queue_duration_bucket{le="+Inf"} 190989.0
tgi_request_queue_duration_count 190989.0
tgi_request_queue_duration_sum 2315.321170568466
# HELP vllm:cache_config_info information of cache_config
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",enable_prefix_caching="False",gpu_memory_utilization="0.9",num_cpu_blocks="2730",num_gpu_blocks="9670",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="bigscience/bloom-560m"} 0.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="bigscience/bloom-560m"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="bigscience/bloom-560m"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="bigscience/bloom-560m"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="bigscience/bloom-560m"} 0.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="bigscience/bloom-560m"} 1196.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="bigscience/bloom-560m"} 218152.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="bigscience/bloom-560m"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="bigscience/bloom-560m"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="bigscience/bloom-560m"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="bigscience/bloom-560m"} 112.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="bigscience/bloom-560m"} 211.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="bigscience/bloom-560m"} 211.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="bigscience/bloom-560m"} 213.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="bigscience/bloom-560m"} 214.0
vllm:time_to_first_token_seconds_bucket{le="0.25",model_name="bigscience/bloom-560m"} 214.0
vllm:time_to_first_token_seconds_bucket{le="0.5",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="0.75",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="1.0",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="2.5",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="5.0",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="7.5",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="10.0",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_bucket{le="+Inf",model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_count{model_name="bigscience/bloom-560m"} 215.0
vllm:time_to_first_token_seconds_sum{model_name="bigscience/bloom-560m"} 4.80142879486084
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_bucket{le="0.01",model_name="bigscience/bloom-560m"} 6053.0
vllm:time_per_output_token_seconds_bucket{le="0.025",model_name="bigscience/bloom-560m"} 179463.0
vllm:time_per_output_token_seconds_bucket{le="0.05",model_name="bigscience/bloom-560m"} 217009.0
vllm:time_per_output_token_seconds_bucket{le="0.075",model_name="bigscience/bloom-560m"} 217181.0
vllm:time_per_output_token_seconds_bucket{le="0.1",model_name="bigscience/bloom-560m"} 217237.0
vllm:time_per_output_token_seconds_bucket{le="0.15",model_name="bigscience/bloom-560m"} 217450.0
vllm:time_per_output_token_seconds_bucket{le="0.2",model_name="bigscience/bloom-560m"} 217896.0
vllm:time_per_output_token_seconds_bucket{le="0.3",model_name="bigscience/bloom-560m"} 217896.0
vllm:time_per_output_token_seconds_bucket{le="0.4",model_name="bigscience/bloom-560m"} 217896.0
vllm:time_per_output_token_seconds_bucket{le="0.5",model_name="bigscience/bloom-560m"} 217896.0
vllm:time_per_output_token_seconds_bucket{le="0.75",model_name="bigscience/bloom-560m"} 217937.0
vllm:time_per_output_token_seconds_bucket{le="1.0",model_name="bigscience/bloom-560m"} 217937.0
vllm:time_per_output_token_seconds_bucket{le="2.5",model_name="bigscience/bloom-560m"} 217937.0
vllm:time_per_output_token_seconds_bucket{le="+Inf",model_name="bigscience/bloom-560m"} 217937.0
vllm:time_per_output_token_seconds_count{model_name="bigscience/bloom-560m"} 217937.0
vllm:time_per_output_token_seconds_sum{model_name="bigscience/bloom-560m"} 4562.833616018295
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_bucket{le="1.0",model_name="bigscience/bloom-560m"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="2.5",model_name="bigscience/bloom-560m"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="5.0",model_name="bigscience/bloom-560m"} 2.0
vllm:e2e_request_latency_seconds_bucket{le="10.0",model_name="bigscience/bloom-560m"} 5.0
vllm:e2e_request_latency_seconds_bucket{le="15.0",model_name="bigscience/bloom-560m"} 16.0
vllm:e2e_request_latency_seconds_bucket{le="20.0",model_name="bigscience/bloom-560m"} 40.0
vllm:e2e_request_latency_seconds_bucket{le="30.0",model_name="bigscience/bloom-560m"} 215.0
vllm:e2e_request_latency_seconds_bucket{le="40.0",model_name="bigscience/bloom-560m"} 215.0
vllm:e2e_request_latency_seconds_bucket{le="50.0",model_name="bigscience/bloom-560m"} 215.0
vllm:e2e_request_latency_seconds_bucket{le="60.0",model_name="bigscience/bloom-560m"} 215.0
vllm:e2e_request_latency_seconds_bucket{le="+Inf",model_name="bigscience/bloom-560m"} 215.0
vllm:e2e_request_latency_seconds_count{model_name="bigscience/bloom-560m"} 215.0
vllm:e2e_request_latency_seconds_sum{model_name="bigscience/bloom-560m"} 4567.635044813156
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="bigscience/bloom-560m"} 0.0
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="bigscience/bloom-560m"} 0.0
# HELP tgi_queue_size Current number of queued requests
# TYPE tgi_queue_size gauge
tgi_queue_size 0.0
# HELP tgi_batch_current_size Current batch size
# TYPE tgi_batch_current_size gauge
tgi_batch_current_size 0.0
# HELP tgi_batch_inference_duration Time taken for each forward-pass iteration (in seconds)
# TYPE tgi_batch_inference_duration histogram
tgi_batch_inference_duration_bucket{le="0.001",method="{'method': 'prefill'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.002",method="{'method': 'prefill'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.005",method="{'method': 'prefill'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.01",method="{'method': 'prefill'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.02",method="{'method': 'prefill'}"} 112.0
tgi_batch_inference_duration_bucket{le="0.05",method="{'method': 'prefill'}"} 211.0
tgi_batch_inference_duration_bucket{le="0.1",method="{'method': 'prefill'}"} 214.0
tgi_batch_inference_duration_bucket{le="0.2",method="{'method': 'prefill'}"} 214.0
tgi_batch_inference_duration_bucket{le="0.5",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="1.0",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="2.0",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="5.0",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="10.0",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="20.0",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="50.0",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_bucket{le="+Inf",method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_count{method="{'method': 'prefill'}"} 215.0
tgi_batch_inference_duration_sum{method="{'method': 'prefill'}"} 4.80142879486084
tgi_batch_inference_duration_bucket{le="0.001",method="{'method': 'next_token'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.002",method="{'method': 'next_token'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.005",method="{'method': 'next_token'}"} 0.0
tgi_batch_inference_duration_bucket{le="0.01",method="{'method': 'next_token'}"} 6053.0
tgi_batch_inference_duration_bucket{le="0.02",method="{'method': 'next_token'}"} 111723.0
tgi_batch_inference_duration_bucket{le="0.05",method="{'method': 'next_token'}"} 217009.0
tgi_batch_inference_duration_bucket{le="0.1",method="{'method': 'next_token'}"} 217237.0
tgi_batch_inference_duration_bucket{le="0.2",method="{'method': 'next_token'}"} 217896.0
tgi_batch_inference_duration_bucket{le="0.5",method="{'method': 'next_token'}"} 217896.0
tgi_batch_inference_duration_bucket{le="1.0",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_bucket{le="2.0",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_bucket{le="5.0",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_bucket{le="10.0",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_bucket{le="20.0",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_bucket{le="50.0",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_bucket{le="+Inf",method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_count{method="{'method': 'next_token'}"} 217937.0
tgi_batch_inference_duration_sum{method="{'method': 'next_token'}"} 4562.833616018295
# HELP tgi_request_input_length Request input length in tokens
# TYPE tgi_request_input_length histogram
tgi_request_input_length_bucket{le="32.0"} 0.0
tgi_request_input_length_bucket{le="64.0"} 0.0
tgi_request_input_length_bucket{le="96.0"} 0.0
tgi_request_input_length_bucket{le="128.0"} 0.0
tgi_request_input_length_bucket{le="160.0"} 0.0
tgi_request_input_length_bucket{le="192.0"} 0.0
tgi_request_input_length_bucket{le="224.0"} 0.0
tgi_request_input_length_bucket{le="256.0"} 0.0
tgi_request_input_length_bucket{le="288.0"} 0.0
tgi_request_input_length_bucket{le="320.0"} 0.0
tgi_request_input_length_bucket{le="352.0"} 0.0
tgi_request_input_length_bucket{le="384.0"} 0.0
tgi_request_input_length_bucket{le="416.0"} 0.0
tgi_request_input_length_bucket{le="448.0"} 0.0
tgi_request_input_length_bucket{le="480.0"} 0.0
tgi_request_input_length_bucket{le="512.0"} 0.0
tgi_request_input_length_bucket{le="544.0"} 0.0
tgi_request_input_length_bucket{le="576.0"} 0.0
tgi_request_input_length_bucket{le="608.0"} 0.0
tgi_request_input_length_bucket{le="640.0"} 0.0
tgi_request_input_length_bucket{le="672.0"} 0.0
tgi_request_input_length_bucket{le="704.0"} 0.0
tgi_request_input_length_bucket{le="736.0"} 0.0
tgi_request_input_length_bucket{le="768.0"} 0.0
tgi_request_input_length_bucket{le="800.0"} 0.0
tgi_request_input_length_bucket{le="832.0"} 0.0
tgi_request_input_length_bucket{le="864.0"} 0.0
tgi_request_input_length_bucket{le="896.0"} 0.0
tgi_request_input_length_bucket{le="928.0"} 0.0
tgi_request_input_length_bucket{le="960.0"} 0.0
tgi_request_input_length_bucket{le="992.0"} 0.0
tgi_request_input_length_bucket{le="1024.0"} 0.0
tgi_request_input_length_bucket{le="1056.0"} 0.0
tgi_request_input_length_bucket{le="1088.0"} 0.0
tgi_request_input_length_bucket{le="1120.0"} 0.0
tgi_request_input_length_bucket{le="1152.0"} 0.0
tgi_request_input_length_bucket{le="1184.0"} 0.0
tgi_request_input_length_bucket{le="1216.0"} 0.0
tgi_request_input_length_bucket{le="1248.0"} 0.0
tgi_request_input_length_bucket{le="1280.0"} 0.0
tgi_request_input_length_bucket{le="1312.0"} 0.0
tgi_request_input_length_bucket{le="1344.0"} 0.0
tgi_request_input_length_bucket{le="1376.0"} 0.0
tgi_request_input_length_bucket{le="1408.0"} 0.0
tgi_request_input_length_bucket{le="1440.0"} 0.0
tgi_request_input_length_bucket{le="1472.0"} 0.0
tgi_request_input_length_bucket{le="1504.0"} 0.0
tgi_request_input_length_bucket{le="1536.0"} 0.0
tgi_request_input_length_bucket{le="1568.0"} 0.0
tgi_request_input_length_bucket{le="1600.0"} 0.0
tgi_request_input_length_bucket{le="1632.0"} 0.0
tgi_request_input_length_bucket{le="1664.0"} 0.0
tgi_request_input_length_bucket{le="1696.0"} 0.0
tgi_request_input_length_bucket{le="1728.0"} 0.0
tgi_request_input_length_bucket{le="1760.0"} 0.0
tgi_request_input_length_bucket{le="1792.0"} 0.0
tgi_request_input_length_bucket{le="1824.0"} 0.0
tgi_request_input_length_bucket{le="1856.0"} 0.0
tgi_request_input_length_bucket{le="1888.0"} 0.0
tgi_request_input_length_bucket{le="1920.0"} 0.0
tgi_request_input_length_bucket{le="1952.0"} 0.0
tgi_request_input_length_bucket{le="1984.0"} 0.0
tgi_request_input_length_bucket{le="2016.0"} 0.0
tgi_request_input_length_bucket{le="2048.0"} 0.0
tgi_request_input_length_bucket{le="+Inf"} 0.0
tgi_request_input_length_count 0.0
tgi_request_input_length_sum 0.0
# HELP tgi_request_generated_tokens Number of tokens generated for request
# TYPE tgi_request_generated_tokens histogram
tgi_request_generated_tokens_bucket{le="32.0"} 0.0
tgi_request_generated_tokens_bucket{le="64.0"} 0.0
tgi_request_generated_tokens_bucket{le="96.0"} 0.0
tgi_request_generated_tokens_bucket{le="128.0"} 0.0
tgi_request_generated_tokens_bucket{le="160.0"} 0.0
tgi_request_generated_tokens_bucket{le="192.0"} 0.0
tgi_request_generated_tokens_bucket{le="224.0"} 0.0
tgi_request_generated_tokens_bucket{le="256.0"} 0.0
tgi_request_generated_tokens_bucket{le="288.0"} 0.0
tgi_request_generated_tokens_bucket{le="320.0"} 0.0
tgi_request_generated_tokens_bucket{le="352.0"} 0.0
tgi_request_generated_tokens_bucket{le="384.0"} 0.0
tgi_request_generated_tokens_bucket{le="416.0"} 0.0
tgi_request_generated_tokens_bucket{le="448.0"} 0.0
tgi_request_generated_tokens_bucket{le="480.0"} 0.0
tgi_request_generated_tokens_bucket{le="512.0"} 0.0
tgi_request_generated_tokens_bucket{le="544.0"} 0.0
tgi_request_generated_tokens_bucket{le="576.0"} 0.0
tgi_request_generated_tokens_bucket{le="608.0"} 0.0
tgi_request_generated_tokens_bucket{le="640.0"} 0.0
tgi_request_generated_tokens_bucket{le="672.0"} 0.0
tgi_request_generated_tokens_bucket{le="704.0"} 0.0
tgi_request_generated_tokens_bucket{le="736.0"} 0.0
tgi_request_generated_tokens_bucket{le="768.0"} 0.0
tgi_request_generated_tokens_bucket{le="800.0"} 0.0
tgi_request_generated_tokens_bucket{le="832.0"} 0.0
tgi_request_generated_tokens_bucket{le="864.0"} 0.0
tgi_request_generated_tokens_bucket{le="896.0"} 0.0
tgi_request_generated_tokens_bucket{le="928.0"} 0.0
tgi_request_generated_tokens_bucket{le="960.0"} 0.0
tgi_request_generated_tokens_bucket{le="992.0"} 0.0
tgi_request_generated_tokens_bucket{le="1024.0"} 0.0
tgi_request_generated_tokens_bucket{le="1056.0"} 0.0
tgi_request_generated_tokens_bucket{le="1088.0"} 0.0
tgi_request_generated_tokens_bucket{le="1120.0"} 0.0
tgi_request_generated_tokens_bucket{le="1152.0"} 0.0
tgi_request_generated_tokens_bucket{le="1184.0"} 0.0
tgi_request_generated_tokens_bucket{le="1216.0"} 0.0
tgi_request_generated_tokens_bucket{le="1248.0"} 0.0
tgi_request_generated_tokens_bucket{le="1280.0"} 0.0
tgi_request_generated_tokens_bucket{le="1312.0"} 0.0
tgi_request_generated_tokens_bucket{le="1344.0"} 0.0
tgi_request_generated_tokens_bucket{le="1376.0"} 0.0
tgi_request_generated_tokens_bucket{le="1408.0"} 0.0
tgi_request_generated_tokens_bucket{le="1440.0"} 0.0
tgi_request_generated_tokens_bucket{le="1472.0"} 0.0
tgi_request_generated_tokens_bucket{le="1504.0"} 0.0
tgi_request_generated_tokens_bucket{le="1536.0"} 0.0
tgi_request_generated_tokens_bucket{le="1568.0"} 0.0
tgi_request_generated_tokens_bucket{le="1600.0"} 0.0
tgi_request_generated_tokens_bucket{le="1632.0"} 0.0
tgi_request_generated_tokens_bucket{le="1664.0"} 0.0
tgi_request_generated_tokens_bucket{le="1696.0"} 0.0
tgi_request_generated_tokens_bucket{le="1728.0"} 0.0
tgi_request_generated_tokens_bucket{le="1760.0"} 0.0
tgi_request_generated_tokens_bucket{le="1792.0"} 0.0
tgi_request_generated_tokens_bucket{le="1824.0"} 0.0
tgi_request_generated_tokens_bucket{le="1856.0"} 0.0
tgi_request_generated_tokens_bucket{le="1888.0"} 0.0
tgi_request_generated_tokens_bucket{le="1920.0"} 0.0
tgi_request_generated_tokens_bucket{le="1952.0"} 0.0
tgi_request_generated_tokens_bucket{le="1984.0"} 0.0
tgi_request_generated_tokens_bucket{le="2016.0"} 0.0
tgi_request_generated_tokens_bucket{le="2048.0"} 0.0
tgi_request_generated_tokens_bucket{le="+Inf"} 0.0
tgi_request_generated_tokens_count 0.0
tgi_request_generated_tokens_sum 0.0

Signed-off-by: Joe Runde <[email protected]>
Copy link
Member

@prashantgupta24 prashantgupta24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@joerunde joerunde merged commit 8c548e4 into main Apr 18, 2024
14 checks passed
@joerunde joerunde deleted the tgis-metrics branch April 18, 2024 21:02
Copy link
Contributor

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated review, this looks great thanks @joerunde!

Later we can perhaps make the TGIS metrics configurable, i.e. to be able to toggle back to regular vLLM metrics.

from vllm.entrypoints.grpc.pb.generation_pb2 import (GenerationResponse,
Parameters, StopReason)


def log_response(inputs: List[str], params: Parameters, prefix_id: str,
response: GenerationResponse, times, kind_log: str,
method_str: str, logger: logging.Logger):
response: GenerationResponse, engine_response: RequestOutput,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor query.. any reason for passing engine_response rather than engine_response.metrics here?

vllm_stat_logger=vllm_stat_logger,
max_sequence_len=self.config.max_model_len)
# 🌶️🌶️🌶️ sneaky sneak
self.engine.engine.stat_logger = tgis_stats_logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌶️

@joerunde
Copy link
Collaborator Author

joerunde commented May 3, 2024

Later we can perhaps make the TGIS metrics configurable, i.e. to be able to toggle back to regular vLLM metrics.

@njhill this PR shouldn't disable the regular vllm metrics, it'll output both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants