-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port metrics from aioprometheus
to prometheus_client
#2730
Port metrics from aioprometheus
to prometheus_client
#2730
Conversation
Thanks for doing this @hmellor! In general LGTM, will try out on the E2E with the grafana dashboard I made this weekend |
No problem @rib-2! One thing I'm not sure about is the documentation of the metrics. The documentation page https://docs.vllm.ai/en/latest/serving/metrics.html doesn't contain the changes you made in your PR, so I think something may be broken (either that or the documentation is updated asynchronously and it just hasn't happened yet). |
I do not think the documentation is pointing at main (and the previous prometheus stuff I added is not yet in the current pypi version), but I am not sure |
"vllm:e2e_request_latency_seconds", | ||
"Histogram of end to end request latency in seconds.", | ||
buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0]) | ||
class Metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the class itself necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to delay the instantiation of the Counter
s, Gauge
s and Histogram
s until instantiation of the StatsLogger
so that they can be constructed with their labelnames
known.
I opted for a class so that:
- The code didn't need to be moved around very much
- Each metric can be accessed like
self.metrics.gauge_scheduler_running
Here are two alternatives:
- Replace the class with a function and return a
dict
. But then you'd have to access them likeself.metrics["gauge_scheduler_running"]
, which doesn't seem as nice. - Move all of the
Counter
,Gauge
andHistogram
instantiations intoStatsLogger.__init__()
. Then they can all be put directly intoself
.
Let me know what you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmellor Thanks for the PR!
Another advantage of porting to the official Prometheus library is that we'll get the following metrics for free:
Metric | Type | Description |
---|---|---|
python_gc_objects_collected_total | counter | Objects collected during gc |
python_gc_objects_uncollectable_total | counter | Uncollectable objects found during GC |
python_gc_collections_total | counter | Number of times this generation was collected |
python_info | gauge | Python platform information |
process_virtual_memory_bytes | gauge | Virtual memory size in bytes. |
process_resident_memory_bytes | gauge | Resident memory size in bytes. |
process_start_time_seconds | gauge | Start time of the process since unix epoch in seconds |
process_cpu_seconds_total | counter | Total user and system CPU time spent in seconds. |
process_open_fds | gauge | Number of open file descriptors. |
process_max_fds | gauge | Maximum number of open file descriptors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a replacement for
app.add_middleware(MetricsMiddleware) # Trace HTTP server metrics
I think this PR removed the HTTP metrics.
Can you also pull the latest main branch and trigger the CI to run? |
|
Hmmm it looks non trivial to add a robust custom collector. Since we didn't document the http metrics. I think it is fine for now. |
Here is the source of |
The test was still still treating the metrics as global variables and trying to access them using the I've updated the tests to access the metrics from inside the |
The next issue was that The fix I implemented was to reset the |
This fix will unregister any vLLM collectors if they already exist in the registry. I also added |
@hmellor @simon-mo FYI with the prometheus library I've seen all |
Currently, there is a PR to revert back to In the PR I have proposed a work around that removes the redirect and allows us to continue using |
As mentioned in #2316 (comment),
prometheus_client
:aioprometheus
aioprometheus
This PR ports the metrics from
aioprometheus
toprometheus_client
.Notable changes:
Metrics
class whichStatsLogger
contains an instance of.prometheus_client
have their label names assigned at construction (no more globallabels
dictionary).labels()
method.LLM()
class viakwargs
(this allows tests to lowergpu_memory_utilization
on a case by case basis, which was useful when debugging issues with the metrics tests)