Extend benchmarks with MATH subset #28

cstub · 2025-01-20T21:02:01Z

Add MATH subset of m-ric/smol_agents_benchmark dataset to benchmarks

* Add MATH subset of m-ric/smol_agents_benchmark dataset to benchmarks

krasserm

Looks good, only minor change requests. Great to have that in!

krasserm · 2025-01-21T03:56:44Z

README.md


 [<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

-When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
+When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

 [<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)


Title in the plot doesn't mention m-ric/smol_agents_benchmark. Maybe something like

freeact performance on m-ric/agents_medium_benchmark_2 (GAIA, GSM8K, SimpleQA) ric/smol_agents_benchmark (MATH)

or similar, maybe with a better formatting/alignment?

krasserm · 2025-01-21T03:57:29Z

README.md


 [<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

-When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
+When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

 [<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)


The changes from this README should also go into docs/evaluation.md

krasserm · 2025-01-21T03:59:24Z

evaluation/README.md

 | deepseek-v3                | SimpleQA | exact_match     |      60.0 |
 | deepseek-v3                | SimpleQA | llm_as_judge    |      67.5 |

-When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
+When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

 [<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)



We also need to upload the new pre-generated outputs.

krasserm

LGTM

Extend benchmarks with MATH subset

5fa652d

* Add MATH subset of m-ric/smol_agents_benchmark dataset to benchmarks

cstub force-pushed the wip-benchmark-math branch from 1bb4f2e to 5fa652d Compare January 20, 2025 21:03

cstub changed the title ~~Add MATH subset of m-ric/smol_agents_benchmark to benchmarks~~ Extend benchmarks with MATH subset Jan 20, 2025

krasserm requested changes Jan 21, 2025

View reviewed changes

cstub and others added 3 commits January 21, 2025 10:23

Incorporate review comments

653ab87

Incorporate review comments

3938fae

Add pre-generated benchmark results to README.md

56fcbb0

krasserm approved these changes Jan 21, 2025

View reviewed changes

cstub merged commit 41acaff into main Jan 21, 2025
9 checks passed

cstub deleted the wip-benchmark-math branch January 21, 2025 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend benchmarks with MATH subset #28

Extend benchmarks with MATH subset #28

cstub commented Jan 20, 2025 •

edited

Loading

krasserm left a comment

krasserm Jan 21, 2025

krasserm Jan 21, 2025

krasserm Jan 21, 2025

krasserm left a comment

Extend benchmarks with MATH subset #28

Extend benchmarks with MATH subset #28

Conversation

cstub commented Jan 20, 2025 • edited Loading

krasserm left a comment

Choose a reason for hiding this comment

krasserm Jan 21, 2025

Choose a reason for hiding this comment

krasserm Jan 21, 2025

Choose a reason for hiding this comment

krasserm Jan 21, 2025

Choose a reason for hiding this comment

krasserm left a comment

Choose a reason for hiding this comment

cstub commented Jan 20, 2025 •

edited

Loading