Port metrics from `aioprometheus` to `prometheus_client` #2730

hmellor · 2024-02-02T16:26:36Z

As mentioned in #2316 (comment), prometheus_client:

Is the official Prometheus Python client
Is more actively maintained than aioprometheus
Has >23x more stars than aioprometheus

This PR ports the metrics from aioprometheus to prometheus_client.

Notable changes:

The Prometheus metrics are no longer in the global scope. They are contained in a Metrics class which StatsLogger contains an instance of.
The metrics in prometheus_client have their label names assigned at construction (no more global labels dictionary).
I then pass the values as kwargs using the chain-able labels() method.
Tests can now pass any parameter to the LLM() class via kwargs (this allows tests to lower gpu_memory_utilization on a case by case basis, which was useful when debugging issues with the metrics tests)

vllm/engine/metrics.py

robertgshaw2-redhat · 2024-02-02T18:31:21Z

Thanks for doing this @hmellor! In general LGTM, will try out on the E2E with the grafana dashboard I made this weekend

hmellor · 2024-02-02T18:39:50Z

No problem @rib-2!

One thing I'm not sure about is the documentation of the metrics. The documentation page https://docs.vllm.ai/en/latest/serving/metrics.html doesn't contain the changes you made in your PR, so I think something may be broken (either that or the documentation is updated asynchronously and it just hasn't happened yet).

robertgshaw2-redhat · 2024-02-02T18:44:59Z

I do not think the documentation is pointing at main (and the previous prometheus stuff I added is not yet in the current pypi version), but I am not sure

simon-mo · 2024-02-02T23:35:49Z

vllm/engine/metrics.py

-    "vllm:e2e_request_latency_seconds",
-    "Histogram of end to end request latency in seconds.",
-    buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])
+class Metrics:


Is the class itself necessary?

I needed to delay the instantiation of the Counters, Gauges and Histograms until instantiation of the StatsLogger so that they can be constructed with their labelnames known.

I opted for a class so that:

The code didn't need to be moved around very much

Each metric can be accessed like self.metrics.gauge_scheduler_running

Here are two alternatives:

Replace the class with a function and return a dict. But then you'd have to access them like self.metrics["gauge_scheduler_running"], which doesn't seem as nice.

Move all of the Counter, Gauge and Histogram instantiations into StatsLogger.__init__(). Then they can all be put directly into self.

Let me know what you prefer

ronensc

@hmellor Thanks for the PR!

Another advantage of porting to the official Prometheus library is that we'll get the following metrics for free:

Metric	Type	Description
python_gc_objects_collected_total	counter	Objects collected during gc
python_gc_objects_uncollectable_total	counter	Uncollectable objects found during GC
python_gc_collections_total	counter	Number of times this generation was collected
python_info	gauge	Python platform information
process_virtual_memory_bytes	gauge	Virtual memory size in bytes.
process_resident_memory_bytes	gauge	Resident memory size in bytes.
process_start_time_seconds	gauge	Start time of the process since unix epoch in seconds
process_cpu_seconds_total	counter	Total user and system CPU time spent in seconds.
process_open_fds	gauge	Number of open file descriptors.
process_max_fds	gauge	Maximum number of open file descriptors.

simon-mo

Is there a replacement for

app.add_middleware(MetricsMiddleware)  # Trace HTTP server metrics

I think this PR removed the HTTP metrics.

simon-mo · 2024-02-22T22:52:19Z

Can you also pull the latest main branch and trigger the CI to run?

hmellor · 2024-02-22T23:26:06Z

Is there a replacement for
app.add_middleware(MetricsMiddleware)  # Trace HTTP server metrics
I think this PR removed the HTTP metrics.

client_prometheus does automatically export some metrics, but only about the process, gc and platform (https://prometheus.github.io/client_python/collector/). We could add something like MetricsMiddleware to vLLM using a Custom Collector or directly into client_prometheus with a PR.

simon-mo · 2024-02-22T23:36:34Z

Hmmm it looks non trivial to add a robust custom collector. Since we didn't document the http metrics. I think it is fine for now.

hmellor · 2024-02-22T23:42:13Z

Here is the source of MetricsMiddleware for future reference https://github.com/claws/aioprometheus/blob/master/src/aioprometheus/asgi/middleware.py

simon-mo · 2024-02-23T19:17:25Z

Metrics test failed https://buildkite.com/vllm/ci/builds/1574#018dd334-3869-4542-9bdb-17febe60507a/51-104

hmellor · 2024-02-23T20:50:38Z

The test was still still treating the metrics as global variables and trying to access them using the aioprometheus API.

I've updated the tests to access the metrics from inside the StatLogger (where they now live) using the prometheus_client API.

hmellor · 2024-02-23T22:00:40Z

The next issue was that prometheus_client is stateful and stores registered collectors in prometheus_client.REGISTRY. When the test created >1 LLM (and therefore more than one Metrics) it raised an error stating that these metrics had already been registered. I didn't see this when I ran the tests locally because I don't have enough VRAM to run both tests in the same session (i.e. both tests pass individually).

The fix I implemented was to reset the prometheus_client.REGISTRY in Metrics.__init__(), but that didn't work.

hmellor · 2024-02-23T22:46:53Z

This fix will unregister any vLLM collectors if they already exist in the registry.

I also added kwargs to VLLmRunner so that I could pass gpu_memory_utilization=0.4 for the metrics tests and allow me to avoid OOM when running both tests in the same pytest session.

…t#2730)

ywang96 · 2024-04-26T07:35:14Z

@hmellor @simon-mo FYI with the prometheus library I've seen all GET /metrics requests getting 307 Temporary Redirect. It looks like Harry already asked this in prometheus/client_python#1016, but I was just wondering what the plan is going forward.

hmellor · 2024-05-01T10:28:23Z

Currently, there is a PR to revert back to aioprometheus (#4511). However, I am not a fan of this solution because that package is relatively stale and will slowly drift as Prometheus server is updated. One notable omission is the Info metric which we have recently been able to start using due to the switch to prometheus_client.

In the PR I have proposed a work around that removes the redirect and allows us to continue using prometheus_client. I don't think it'd be too hard to get this fixed upstream in prometheus_client (properly, not with my workaround).

…t#2730)

Port metrics from aioprometheus to prometheus_client

273d278

simon-mo self-assigned this Feb 2, 2024

robertgshaw2-redhat reviewed Feb 2, 2024

View reviewed changes

vllm/engine/metrics.py Outdated Show resolved Hide resolved

histogram_time_per_output_tokens -> histogram_time_per_output_token

73138ed

simon-mo reviewed Feb 2, 2024

View reviewed changes

ronensc reviewed Feb 6, 2024

View reviewed changes

hmellor requested a review from simon-mo February 21, 2024 21:23

hmellor added 2 commits February 21, 2024 21:30

Merge branch 'main' into port-to-prometheus-client

e13ed3e

Merge branch 'main' into port-to-prometheus-client

e0a21f3

simon-mo approved these changes Feb 22, 2024

View reviewed changes

Merge branch 'main' into port-to-prometheus-client

0502062

Fix metrics tests

e745215

Reset registry when new Metrics is created

d71a5a0

hmellor added 3 commits February 23, 2024 22:34

Allow passing kwargs to VllmRunner

56352ec

Reduce gpu_memory_utilization for metrics tests

1809c08

Unregister any vLLM collectors if they exist in the registry

19c481d

hmellor requested a review from simon-mo February 23, 2024 22:47

simon-mo merged commit ef978fe into vllm-project:main Feb 25, 2024
22 checks passed

hmellor deleted the port-to-prometheus-client branch March 4, 2024 09:40

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Port metrics from aioprometheus to prometheus_client (vllm-projec…

aaa1428

…t#2730)

hmellor mentioned this pull request Mar 12, 2024

More metrics #2302

Closed

This was referenced Apr 30, 2024

v0.4.2 Release Tracker #4505

Closed

[Bugfix] Revert to aioprometheus to avoid 307 redirect #4511

Closed

robertgshaw2-redhat mentioned this pull request Aug 8, 2024

[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend #7279

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

Port metrics from aioprometheus to prometheus_client (vllm-projec…

fe671f4

…t#2730)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port metrics from `aioprometheus` to `prometheus_client` #2730

Port metrics from `aioprometheus` to `prometheus_client` #2730

hmellor commented Feb 2, 2024 •

edited

Loading

robertgshaw2-redhat commented Feb 2, 2024 •

edited

Loading

hmellor commented Feb 2, 2024

robertgshaw2-redhat commented Feb 2, 2024

simon-mo Feb 2, 2024

hmellor Feb 3, 2024

ronensc left a comment

simon-mo left a comment

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024

simon-mo commented Feb 23, 2024

hmellor commented Feb 23, 2024

hmellor commented Feb 23, 2024

hmellor commented Feb 23, 2024

ywang96 commented Apr 26, 2024 •

edited

Loading

hmellor commented May 1, 2024

Port metrics from aioprometheus to prometheus_client #2730

Port metrics from aioprometheus to prometheus_client #2730

Conversation

hmellor commented Feb 2, 2024 • edited Loading

robertgshaw2-redhat commented Feb 2, 2024 • edited Loading

hmellor commented Feb 2, 2024

robertgshaw2-redhat commented Feb 2, 2024

simon-mo Feb 2, 2024

Choose a reason for hiding this comment

hmellor Feb 3, 2024

Choose a reason for hiding this comment

ronensc left a comment

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024

simon-mo commented Feb 22, 2024

hmellor commented Feb 22, 2024

simon-mo commented Feb 23, 2024

hmellor commented Feb 23, 2024

hmellor commented Feb 23, 2024

hmellor commented Feb 23, 2024

ywang96 commented Apr 26, 2024 • edited Loading

hmellor commented May 1, 2024

Port metrics from `aioprometheus` to `prometheus_client` #2730

Port metrics from `aioprometheus` to `prometheus_client` #2730

hmellor commented Feb 2, 2024 •

edited

Loading

robertgshaw2-redhat commented Feb 2, 2024 •

edited

Loading

ywang96 commented Apr 26, 2024 •

edited

Loading