Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: /metrics endpoint shows less information at latest (0.5.4) vllm docker container. #7782

Closed
kulievvitaly opened this issue Aug 22, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@kulievvitaly
Copy link

Your current environment

Run in docker. Example:

sudo docker run --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=2"' -p 5432:8000 --mount type=bind,source=/ssd_2/huggingface,target=/root/.cache/huggingface vllm/vllm-openai:v0.5.1 --model casperhansen/llama-3-70b-instruct-awq --dtype half --max_model_len 8000 -q awq --gpu-memory-utilization 0.94 --enable-prefix-caching

🐛 Describe the bug

/metrics endpoint from docker with version 0.5.1 contain all data about vllm server. For example 'gpu_cache_usage_perc' metric. Here is example of /metrics endpoint(# deleted):

HELP python_gc_objects_collected_total Objects collected during gc
TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7902.0
python_gc_objects_collected_total{generation="1"} 4609.0
python_gc_objects_collected_total{generation="2"} 1618.0
HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
HELP python_gc_collections_total Number of times this generation was collected
TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 956.0
python_gc_collections_total{generation="1"} 85.0
python_gc_collections_total{generation="2"} 79.0
HELP python_info Python platform information
TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
HELP process_virtual_memory_bytes Virtual memory size in bytes.
TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.0322907136e+011
HELP process_resident_memory_bytes Resident memory size in bytes.
TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.303376896e+09
HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72432315514e+09
HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 103.68
HELP process_open_fds Number of open file descriptors.
TYPE process_open_fds gauge
process_open_fds 76.0
HELP process_max_fds Maximum number of open file descriptors.
TYPE process_max_fds gauge
process_max_fds 1.048576e+06
HELP vllm:cache_config_info information of cache_config
TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",enable_prefix_caching="True",gpu_memory_utilization="0.94",num_cpu_blocks="819",num_gpu_blocks="7093",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
HELP vllm:num_requests_running Number of requests currently running on GPU.
TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:num_requests_waiting Number of requests waiting to be processed.
TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:num_requests_swapped Number of requests swapped to CPU.
TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:prompt_tokens_total Number of prefill tokens processed.
TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:generation_tokens_total Number of generation tokens processed.
TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
TYPE vllm:time_to_first_token_seconds histogram
HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
TYPE vllm:time_per_output_token_seconds histogram
HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
TYPE vllm:e2e_request_latency_seconds histogram
HELP vllm:request_prompt_tokens Number of prefill tokens processed.
TYPE vllm:request_prompt_tokens histogram
HELP vllm:request_generation_tokens Number of generation tokens processed.
TYPE vllm:request_generation_tokens histogram
HELP vllm:request_params_best_of Histogram of the best_of request parameter.
TYPE vllm:request_params_best_of histogram
HELP vllm:request_params_n Histogram of the n request parameter.
TYPE vllm:request_params_n histogram
HELP vllm:request_success_total Count of successfully processed requests.
TYPE vllm:request_success_total counter
HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0
HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="casperhansen/llama-3-70b-instruct-awq"} 0.0

/metrics endpoint from docker with version 0.5.4 does not contain gpu_cache_usage_perc and other important data. Here is example of /metrics endpoint(# deleted):

HELP python_gc_objects_collected_total Objects collected during gc
TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 8639.0
python_gc_objects_collected_total{generation="1"} 4636.0
python_gc_objects_collected_total{generation="2"} 3129.0
HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
HELP python_gc_collections_total Number of times this generation was collected
TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 1260.0
python_gc_collections_total{generation="1"} 114.0
python_gc_collections_total{generation="2"} 8.0
HELP python_info Python platform information
TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
HELP process_virtual_memory_bytes Virtual memory size in bytes.
TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.3026189312e+010
HELP process_resident_memory_bytes Resident memory size in bytes.
TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.19036416e+08
HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72425326064e+09
HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 625.99
HELP process_open_fds Number of open file descriptors.
TYPE process_open_fds gauge
process_open_fds 140.0
HELP process_max_fds Maximum number of open file descriptors.
TYPE process_max_fds gauge
process_max_fds 1.048576e+06

The problem exists in docker with different models and different hardware. Is it possible to get gpu_cache_usage_perc, num_requests_* metrics on latest docker versions?

@kulievvitaly kulievvitaly added the bug Something isn't working label Aug 22, 2024
@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Aug 22, 2024

@kulievvitaly - there is a bug in v0.5.4 re: prometheus metrics. Sorry about that. As a workaround on this version, you can run with --disable-frontend-multiprocessing or downgrade to v0.5.3

The issue is fixed in main with #7279 and we plan to release soon

@kulievvitaly
Copy link
Author

run with --disable-frontend-multiprocessing

Thank you. Confirm - it fixes the bug. Will wait for next release!

@robertgshaw2-redhat
Copy link
Collaborator

New release is out

@kulievvitaly
Copy link
Author

New release is out

Confirm it fixed in docker 0.5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants