Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port metrics from aioprometheus to prometheus_client #2730

Merged
merged 10 commits into from
Feb 25, 2024

Conversation

hmellor
Copy link
Collaborator

@hmellor hmellor commented Feb 2, 2024

As mentioned in #2316 (comment), prometheus_client:

  • Is the official Prometheus Python client
  • Is more actively maintained than aioprometheus
  • Has >23x more stars than aioprometheus

This PR ports the metrics from aioprometheus to prometheus_client.

Notable changes:

  • The Prometheus metrics are no longer in the global scope. They are contained in a Metrics class which StatsLogger contains an instance of.
  • The metrics in prometheus_client have their label names assigned at construction (no more global labels dictionary).
  • I then pass the values as kwargs using the chain-able labels() method.
  • Tests can now pass any parameter to the LLM() class via kwargs (this allows tests to lower gpu_memory_utilization on a case by case basis, which was useful when debugging issues with the metrics tests)

@simon-mo simon-mo self-assigned this Feb 2, 2024
@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Feb 2, 2024

Thanks for doing this @hmellor! In general LGTM, will try out on the E2E with the grafana dashboard I made this weekend

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 2, 2024

No problem @rib-2!

One thing I'm not sure about is the documentation of the metrics. The documentation page https://docs.vllm.ai/en/latest/serving/metrics.html doesn't contain the changes you made in your PR, so I think something may be broken (either that or the documentation is updated asynchronously and it just hasn't happened yet).

@robertgshaw2-redhat
Copy link
Collaborator

I do not think the documentation is pointing at main (and the previous prometheus stuff I added is not yet in the current pypi version), but I am not sure

"vllm:e2e_request_latency_seconds",
"Histogram of end to end request latency in seconds.",
buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])
class Metrics:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the class itself necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to delay the instantiation of the Counters, Gauges and Histograms until instantiation of the StatsLogger so that they can be constructed with their labelnames known.

I opted for a class so that:

  • The code didn't need to be moved around very much
  • Each metric can be accessed like self.metrics.gauge_scheduler_running

Here are two alternatives:

  1. Replace the class with a function and return a dict. But then you'd have to access them like self.metrics["gauge_scheduler_running"], which doesn't seem as nice.
  2. Move all of the Counter, Gauge and Histogram instantiations into StatsLogger.__init__(). Then they can all be put directly into self.

Let me know what you prefer

Copy link
Contributor

@ronensc ronensc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmellor Thanks for the PR!

Another advantage of porting to the official Prometheus library is that we'll get the following metrics for free:

Metric Type Description
python_gc_objects_collected_total counter Objects collected during gc
python_gc_objects_uncollectable_total counter Uncollectable objects found during GC
python_gc_collections_total counter Number of times this generation was collected
python_info gauge Python platform information
process_virtual_memory_bytes gauge Virtual memory size in bytes.
process_resident_memory_bytes gauge Resident memory size in bytes.
process_start_time_seconds gauge Start time of the process since unix epoch in seconds
process_cpu_seconds_total counter Total user and system CPU time spent in seconds.
process_open_fds gauge Number of open file descriptors.
process_max_fds gauge Maximum number of open file descriptors.

@hmellor hmellor requested a review from simon-mo February 21, 2024 21:23
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a replacement for

app.add_middleware(MetricsMiddleware)  # Trace HTTP server metrics

I think this PR removed the HTTP metrics.

@simon-mo
Copy link
Collaborator

Can you also pull the latest main branch and trigger the CI to run?

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 22, 2024

Is there a replacement for

app.add_middleware(MetricsMiddleware)  # Trace HTTP server metrics

I think this PR removed the HTTP metrics.

client_prometheus does automatically export some metrics, but only about the process, gc and platform (https://prometheus.github.io/client_python/collector/). We could add something like MetricsMiddleware to vLLM using a Custom Collector or directly into client_prometheus with a PR.

@simon-mo
Copy link
Collaborator

Hmmm it looks non trivial to add a robust custom collector. Since we didn't document the http metrics. I think it is fine for now.

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 22, 2024

Here is the source of MetricsMiddleware for future reference https://github.com/claws/aioprometheus/blob/master/src/aioprometheus/asgi/middleware.py

@simon-mo
Copy link
Collaborator

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 23, 2024

The test was still still treating the metrics as global variables and trying to access them using the aioprometheus API.

I've updated the tests to access the metrics from inside the StatLogger (where they now live) using the prometheus_client API.

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 23, 2024

The next issue was that prometheus_client is stateful and stores registered collectors in prometheus_client.REGISTRY. When the test created >1 LLM (and therefore more than one Metrics) it raised an error stating that these metrics had already been registered. I didn't see this when I ran the tests locally because I don't have enough VRAM to run both tests in the same session (i.e. both tests pass individually).

The fix I implemented was to reset the prometheus_client.REGISTRY in Metrics.__init__(), but that didn't work.

@hmellor
Copy link
Collaborator Author

hmellor commented Feb 23, 2024

This fix will unregister any vLLM collectors if they already exist in the registry.

I also added kwargs to VLLmRunner so that I could pass gpu_memory_utilization=0.4 for the metrics tests and allow me to avoid OOM when running both tests in the same pytest session.

@hmellor hmellor requested a review from simon-mo February 23, 2024 22:47
@simon-mo simon-mo merged commit ef978fe into vllm-project:main Feb 25, 2024
22 checks passed
@hmellor hmellor deleted the port-to-prometheus-client branch March 4, 2024 09:40
xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
@hmellor hmellor mentioned this pull request Mar 12, 2024
@ywang96
Copy link
Member

ywang96 commented Apr 26, 2024

@hmellor @simon-mo FYI with the prometheus library I've seen all GET /metrics requests getting 307 Temporary Redirect. It looks like Harry already asked this in prometheus/client_python#1016, but I was just wondering what the plan is going forward.

@hmellor
Copy link
Collaborator Author

hmellor commented May 1, 2024

Currently, there is a PR to revert back to aioprometheus (#4511). However, I am not a fan of this solution because that package is relatively stale and will slowly drift as Prometheus server is updated. One notable omission is the Info metric which we have recently been able to start using due to the switch to prometheus_client.

In the PR I have proposed a work around that removes the redirect and allows us to continue using prometheus_client. I don't think it'd be too hard to get this fixed upstream in prometheus_client (properly, not with my workaround).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants