Replies: 3 comments
-
yes, that would be cool. similar to https://docs.vllm.ai/en/latest/serving/metrics.html |
Beta Was this translation helpful? Give feedback.
-
For use in production the bigger hurdle to take is continuous batched inference. That should allow multi user usage. At this point Prometheus could give some insight into performance on the hardware you're running it, but you might as well just scale based on cpu utilization instead of more detailed usage stats. If llama-cpp-python reaches the point where we can use it for on premise inference reliably I'll gladly look into sensible Prometheus statistics to report so we can use it in KEDA. |
Beta Was this translation helpful? Give feedback.
-
Any update? |
Beta Was this translation helpful? Give feedback.
-
Hi folks- Love this project and have it generally working in docker (built locally), I've got some ideas for giving it more operational durability (issue I posted previously about stopping the model if a client disconnects, which looks like it needs llama.cpp support first), but one thing I'm a fan of is prometheus metrics - particularly when running servers inside a Kubernetes environment. You can leverage them for custom horizontal-pod-autoscaler rules to address scalability and multi-user load.
I'm thinking about things like "is there an actively running model query", "how many requests have we fielded", "how many completions vs embeddings etc".
I have enough experience with Prometheus itself to write the code for this myself but I'm looking for guidance for how the application flow works in the API server. I take it this is based on FastAPI, although my python is a bit rough and I've never implemented server backends in it before. Where would you recommend someone look first for adding "metrics hooks" into the API request code?
Beta Was this translation helpful? Give feedback.
All reactions