Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend benchmarks with MATH subset #28

Merged
merged 4 commits into from
Jan 21, 2025
Merged

Extend benchmarks with MATH subset #28

merged 4 commits into from
Jan 21, 2025

Conversation

cstub
Copy link
Contributor

@cstub cstub commented Jan 20, 2025

  • Add MATH subset of m-ric/smol_agents_benchmark dataset to benchmarks

* Add MATH subset of m-ric/smol_agents_benchmark dataset to
  benchmarks
@cstub cstub force-pushed the wip-benchmark-math branch from 1bb4f2e to 5fa652d Compare January 20, 2025 21:03
@cstub cstub changed the title Add MATH subset of m-ric/smol_agents_benchmark to benchmarks Extend benchmarks with MATH subset Jan 20, 2025
Copy link
Member

@krasserm krasserm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only minor change requests. Great to have that in!


[<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

[<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title in the plot doesn't mention m-ric/smol_agents_benchmark. Maybe something like

freeact performance on
m-ric/agents_medium_benchmark_2 (GAIA, GSM8K, SimpleQA)
ric/smol_agents_benchmark (MATH)

or similar, maybe with a better formatting/alignment?


[<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

[<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from this README should also go into docs/evaluation.md

| deepseek-v3 | SimpleQA | exact_match | 60.0 |
| deepseek-v3 | SimpleQA | llm_as_judge | 67.5 |

When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

[<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to upload the new pre-generated outputs.

Copy link
Member

@krasserm krasserm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cstub cstub merged commit 41acaff into main Jan 21, 2025
9 checks passed
@cstub cstub deleted the wip-benchmark-math branch January 21, 2025 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants