A compilation of performance metrics for Large Language Models.
Model's card, and specifically, some evaluation details about its performance has been released for the llama 3 family models.
Extracted from the website:
This document contains additional context on the settings and parameters for how we evaluated the Llama 3 pre-trained and instruct-aligned models.
It is very hard to argue agains this being the reference for all references in terms of science-based benchmarking for AI/ML in general.
Extracted from the website:
Building trusted, safe, and efficient AI requires better systems for measurement and accountability. MLCommons’ collective engineering with industry and academia continually measures and improves the accuracy, safety, speed, and efficiency of AI technologies.