[Bug]: /metrics endpoint shows less information at latest (0.5.4) vllm docker container. #7782

kulievvitaly · 2024-08-22T10:58:07Z

Your current environment

Run in docker. Example:

sudo docker run --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=2"' -p 5432:8000 --mount type=bind,source=/ssd_2/huggingface,target=/root/.cache/huggingface vllm/vllm-openai:v0.5.1 --model casperhansen/llama-3-70b-instruct-awq --dtype half --max_model_len 8000 -q awq --gpu-memory-utilization 0.94 --enable-prefix-caching

🐛 Describe the bug

/metrics endpoint from docker with version 0.5.1 contain all data about vllm server. For example 'gpu_cache_usage_perc' metric. Here is example of /metrics endpoint(# deleted):

HELP python_gc_objects_collected_total Objects collected during gc
TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7902.0
python_gc_objects_collected_total{generation="1"} 4609.0
python_gc_objects_collected_total{generation="2"} 1618.0
HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
HELP python_gc_collections_total Number of times this generation was collected
TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 956.0
python_gc_collections_total{generation="1"} 85.0
python_gc_collections_total{generation="2"} 79.0
HELP python_info Python platform information
TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
HELP process_virtual_memory_bytes Virtual memory size in bytes.
TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.0322907136e+011
HELP process_resident_memory_bytes Resident memory size in bytes.
TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.303376896e+09
HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72432315514e+09
HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 103.68
HELP process_open_fds Number of open file descriptors.
TYPE process_open_fds gauge
process_open_fds 76.0
HELP process_max_fds Maximum number of open file descriptors.
TYPE process_max_fds gauge
process_max_fds 1.048576e+06
HELP vllm:cache_config_info information of cache_config
TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",enable_prefix_caching="True",gpu_memory_utilization="0.94",num_cpu_blocks="819",num_gpu_blocks="7093",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
HELP vllm:num_requests_running Number of requests currently running on GPU.
TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:num_requests_waiting Number of requests waiting to be processed.
TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:num_requests_swapped Number of requests swapped to CPU.
TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:prompt_tokens_total Number of prefill tokens processed.
TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:generation_tokens_total Number of generation tokens processed.
TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
TYPE vllm:time_to_first_token_seconds histogram
HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
TYPE vllm:time_per_output_token_seconds histogram
HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
TYPE vllm:e2e_request_latency_seconds histogram
HELP vllm:request_prompt_tokens Number of prefill tokens processed.
TYPE vllm:request_prompt_tokens histogram
HELP vllm:request_generation_tokens Number of generation tokens processed.
TYPE vllm:request_generation_tokens histogram
HELP vllm:request_params_best_of Histogram of the best_of request parameter.
TYPE vllm:request_params_best_of histogram
HELP vllm:request_params_n Histogram of the n request parameter.
TYPE vllm:request_params_n histogram
HELP vllm:request_success_total Count of successfully processed requests.
TYPE vllm:request_success_total counter
HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0

/metrics endpoint from docker with version 0.5.4 does not contain gpu_cache_usage_perc and other important data. Here is example of /metrics endpoint(# deleted):

HELP python_gc_objects_collected_total Objects collected during gc
TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 8639.0
python_gc_objects_collected_total{generation="1"} 4636.0
python_gc_objects_collected_total{generation="2"} 3129.0
HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
HELP python_gc_collections_total Number of times this generation was collected
TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 1260.0
python_gc_collections_total{generation="1"} 114.0
python_gc_collections_total{generation="2"} 8.0
HELP python_info Python platform information
TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
HELP process_virtual_memory_bytes Virtual memory size in bytes.
TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.3026189312e+010
HELP process_resident_memory_bytes Resident memory size in bytes.
TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.19036416e+08
HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72425326064e+09
HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 625.99
HELP process_open_fds Number of open file descriptors.
TYPE process_open_fds gauge
process_open_fds 140.0
HELP process_max_fds Maximum number of open file descriptors.
TYPE process_max_fds gauge
process_max_fds 1.048576e+06

The problem exists in docker with different models and different hardware. Is it possible to get gpu_cache_usage_perc, num_requests_* metrics on latest docker versions?

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2024-08-22T14:16:39Z

@kulievvitaly - there is a bug in v0.5.4 re: prometheus metrics. Sorry about that. As a workaround on this version, you can run with --disable-frontend-multiprocessing or downgrade to v0.5.3

The issue is fixed in main with #7279 and we plan to release soon

kulievvitaly · 2024-08-24T17:29:50Z

run with --disable-frontend-multiprocessing

Thank you. Confirm - it fixes the bug. Will wait for next release!

robertgshaw2-redhat · 2024-08-24T17:30:37Z

New release is out

kulievvitaly · 2024-08-28T09:00:04Z

New release is out

Confirm it fixed in docker 0.5.5

kulievvitaly added the bug Something isn't working label Aug 22, 2024

robertgshaw2-redhat closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: /metrics endpoint shows less information at latest (0.5.4) vllm docker container. #7782

[Bug]: /metrics endpoint shows less information at latest (0.5.4) vllm docker container. #7782

kulievvitaly commented Aug 22, 2024

robertgshaw2-redhat commented Aug 22, 2024 •

edited

Loading

kulievvitaly commented Aug 24, 2024

robertgshaw2-redhat commented Aug 24, 2024

kulievvitaly commented Aug 28, 2024

[Bug]: /metrics endpoint shows less information at latest (0.5.4) vllm docker container. #7782

[Bug]: /metrics endpoint shows less information at latest (0.5.4) vllm docker container. #7782

Comments

kulievvitaly commented Aug 22, 2024

Your current environment

🐛 Describe the bug

robertgshaw2-redhat commented Aug 22, 2024 • edited Loading

kulievvitaly commented Aug 24, 2024

robertgshaw2-redhat commented Aug 24, 2024

kulievvitaly commented Aug 28, 2024

robertgshaw2-redhat commented Aug 22, 2024 •

edited

Loading