[Feature] Add system metrics collected during evaluation to eval_output #280

acere · 2024-05-24T01:39:44Z

It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.

Zhenshan-Jin · 2024-05-28T21:52:42Z

@acere Thanks for bringing it up. Just to clarify,

by system metrics, e.g. latency, do you mean the latency to call the evaluation model like detoxify or the overall latency to evaluate the model?
what is your use case to leverage these metrics?

Thank you!

athewsey · 2024-05-29T06:09:20Z

From my perspective, I'd like fmeval to help more with profiling model latency and cost - so the most interesting metrics to store would be:

The latency of the LLM under test, for each invocation in the dataset, and summarized (with mean, p50, p90, p99) for the overall evaluation
The number of input and output tokens, for models which report this in their response (for e.g. Claude 3 on Bedrock reports usage.input_tokens and usage.output_tokens. Other Bedrock models also provide it, but in different keys of the response)

It's important to consider and compare model quality in the context of cost to run and response latency when making selection decisions. Although these factors are workload-sensitive, fmeval is at least running a dataset of representative examples through the model at speed: So while it's no substitute for a dedicated performance test, it could give a very useful initial indication of trade-offs between output quality and speed/cost.

xiaoyi-cheng · 2024-05-29T16:22:50Z

Thanks for you feedback! We will add it to our roadmap and prioritize it.

acere · 2024-05-29T23:10:05Z

Exactly as @athewsey indicates.
Technical metrics such as latency, time to first token ,and tokens per second (that would require using a streaming interface for models that support it) are usually also part of a model evaluation. And some of these metrics are already provided in the response from the service, for example Anthropic Claude on Amazon Bedrock returns the number of input and output tokens in usage, while other can be inferred by timing the request and response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add system metrics collected during evaluation to eval_output #280

[Feature] Add system metrics collected during evaluation to eval_output #280

acere commented May 24, 2024

Zhenshan-Jin commented May 28, 2024

athewsey commented May 29, 2024

xiaoyi-cheng commented May 29, 2024

acere commented May 29, 2024

[Feature] Add system metrics collected during evaluation to eval_output #280

[Feature] Add system metrics collected during evaluation to eval_output #280

Comments

acere commented May 24, 2024

Zhenshan-Jin commented May 28, 2024

athewsey commented May 29, 2024

xiaoyi-cheng commented May 29, 2024

acere commented May 29, 2024