Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add system metrics collected during evaluation to eval_output #280

Open
acere opened this issue May 24, 2024 · 4 comments
Open

Comments

@acere
Copy link

acere commented May 24, 2024

It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.

@Zhenshan-Jin
Copy link

@acere Thanks for bringing it up. Just to clarify,

  1. by system metrics, e.g. latency, do you mean the latency to call the evaluation model like detoxify or the overall latency to evaluate the model?
  2. what is your use case to leverage these metrics?

Thank you!

@athewsey
Copy link
Contributor

From my perspective, I'd like fmeval to help more with profiling model latency and cost - so the most interesting metrics to store would be:

  • The latency of the LLM under test, for each invocation in the dataset, and summarized (with mean, p50, p90, p99) for the overall evaluation
  • The number of input and output tokens, for models which report this in their response (for e.g. Claude 3 on Bedrock reports usage.input_tokens and usage.output_tokens. Other Bedrock models also provide it, but in different keys of the response)

It's important to consider and compare model quality in the context of cost to run and response latency when making selection decisions. Although these factors are workload-sensitive, fmeval is at least running a dataset of representative examples through the model at speed: So while it's no substitute for a dedicated performance test, it could give a very useful initial indication of trade-offs between output quality and speed/cost.

@xiaoyi-cheng
Copy link
Contributor

Thanks for you feedback! We will add it to our roadmap and prioritize it.

@acere
Copy link
Author

acere commented May 29, 2024

Exactly as @athewsey indicates.
Technical metrics such as latency, time to first token ,and tokens per second (that would require using a streaming interface for models that support it) are usually also part of a model evaluation. And some of these metrics are already provided in the response from the service, for example Anthropic Claude on Amazon Bedrock returns the number of input and output tokens in usage, while other can be inferred by timing the request and response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants