Skip to content

Commit

Permalink
Extend benchmarks with MATH subset
Browse files Browse the repository at this point in the history
* Add MATH subset of m-ric/smol_agents_benchmark dataset to
  benchmarks
  • Loading branch information
cstub committed Jan 20, 2025
1 parent 1a6241f commit 5fa652d
Show file tree
Hide file tree
Showing 6 changed files with 32 additions and 13 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,11 +108,11 @@ We [evaluated](evaluation) `freeact` using five state-of-the-art models:
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:

[<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

[<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)

Expand Down
Binary file modified docs/eval/eval-plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 7 additions & 2 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,34 +8,39 @@ We evaluated `freeact` using five state-of-the-art models:
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:

[<img src="../docs/eval/eval-plot.png" alt="Performance">](../docs/eval/eval-plot.png)

| model | subset | eval_protocol | % correct |
|:---------------------------|:---------|:----------------|----------:|
| claude-3-5-sonnet-20241022 | GAIA | exact_match | **53.1** |
| claude-3-5-sonnet-20241022 | GSM8K | exact_match | **95.7** |
| claude-3-5-sonnet-20241022 | MATH | exact_match | **90.0** |
| claude-3-5-sonnet-20241022 | SimpleQA | exact_match | **57.5** |
| claude-3-5-sonnet-20241022 | SimpleQA | llm_as_judge | **72.5** |
| claude-3-5-haiku-20241022 | GAIA | exact_match | 31.2 |
| claude-3-5-haiku-20241022 | GSM8K | exact_match | 90.0 |
| claude-3-5-haiku-20241022 | MATH | exact_match | 76.0 |
| claude-3-5-haiku-20241022 | SimpleQA | exact_match | 52.5 |
| claude-3-5-haiku-20241022 | SimpleQA | llm_as_judge | 70.0 |
| gemini-2.0-flash-exp | GAIA | exact_match | 34.4 |
| gemini-2.0-flash-exp | GSM8K | exact_match | **95.7** |
| gemini-2.0-flash-exp | MATH | exact_match | 88.0 |
| gemini-2.0-flash-exp | SimpleQA | exact_match | 50.0 |
| gemini-2.0-flash-exp | SimpleQA | llm_as_judge | 65.0 |
| qwen2p5-coder-32b-instruct | GAIA | exact_match | 25.0 |
| qwen2p5-coder-32b-instruct | GSM8K | exact_match | **95.7** |
| qwen2p5-coder-32b-instruct | MATH | exact_match | 88.0 |
| qwen2p5-coder-32b-instruct | SimpleQA | exact_match | 52.5 |
| qwen2p5-coder-32b-instruct | SimpleQA | llm_as_judge | 65.0 |
| deepseek-v3 | GAIA | exact_match | 37.5 |
| deepseek-v3 | GSM8K | exact_match | 91.4 |
| deepseek-v3 | MATH | exact_match | 88.0 |
| deepseek-v3 | SimpleQA | exact_match | 60.0 |
| deepseek-v3 | SimpleQA | llm_as_judge | 67.5 |

When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):

[<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)

Expand Down
17 changes: 12 additions & 5 deletions evaluation/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
"""

GSM8K_NORMALIZATION_PROMPT = """
GSM8K_MATH_NORMALIZATION_PROMPT = """
Finish your answer with the following template:
FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should only be a number. Don't use units such as $ or percent sign.
"""
Expand All @@ -53,6 +53,7 @@ class EvaluationSubset(StrEnum):
GSM8K = "GSM8K"
SIMPLEQA = "SimpleQA"
GAIA = "GAIA"
MATH = "MATH"


@app.command()
Expand Down Expand Up @@ -84,8 +85,14 @@ async def amain(

print(f"Output directory: {output_run_dir.absolute()}")

dataset = datasets.load_dataset("m-ric/agents_medium_benchmark_2")
dataset = dataset["train"]
dataset = datasets.concatenate_datasets(
[
datasets.load_dataset("m-ric/agents_medium_benchmark_2")["train"],
datasets.load_dataset("m-ric/smol_agents_benchmark")["test"].filter(
lambda example: example["source"] == "MATH"
),
]
)

if subset is not None:
_subset = str(subset) # convert to string avoid datasets warning
Expand Down Expand Up @@ -134,8 +141,8 @@ async def evaluate_agent(

source = example["source"]
try:
if source == "GSM8K":
normalization_prompt = GSM8K_NORMALIZATION_PROMPT
if source in ["GSM8K", "MATH"]:
normalization_prompt = GSM8K_MATH_NORMALIZATION_PROMPT
elif source == "GAIA":
normalization_prompt = GAIA_NORMALIZATION_PROMPT
elif source == "SimpleQA":
Expand Down
12 changes: 9 additions & 3 deletions evaluation/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,13 @@ def performance(
figsize=(10, 6),
palette="Blues_d",
hue="source_protocol",
hue_order=["GAIA (exact_match)", "GSM8K (exact_match)", "SimpleQA (exact_match)", "SimpleQA (llm_as_judge)"],
hue_order=[
"GAIA (exact_match)",
"GSM8K (exact_match)",
"MATH (exact_match)",
"SimpleQA (exact_match)",
"SimpleQA (llm_as_judge)",
],
title=f"freeact performance on {benchmark_display_name}",
output_file=output_dir / "eval-plot.png",
legend_location="top",
Expand Down Expand Up @@ -85,8 +91,8 @@ def create_barplot(
ax.spines["top"].set_visible(False)

if legend_location == "top":
plt.title(title, pad=50)
plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.05), loc="center", ncol=2)
plt.title(title, pad=70)
plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.10), loc="center", ncol=2)
else:
plt.title(title)
plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")
Expand Down
3 changes: 2 additions & 1 deletion evaluation/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ def score(
score_dataset(results_dir, "SimpleQA", EvalProtocol.LLM_AS_JUDGE),
score_dataset(results_dir, "SimpleQA", EvalProtocol.EXACT_MATCH),
score_dataset(results_dir, "GSM8K", EvalProtocol.EXACT_MATCH),
score_dataset(results_dir, "MATH", EvalProtocol.EXACT_MATCH),
]
)
all_dfs.append(df)
Expand Down Expand Up @@ -125,7 +126,7 @@ def is_correct(example, simpleqa_scorer: SimpleQAScorer, eval_protocol: EvalProt
question = str(example["question"])

match example["source"]:
case "GSM8K":
case "GSM8K" | "MATH":
return get_question_score_gsm8k(answer, true_answer)
case "SimpleQA" if eval_protocol == EvalProtocol.LLM_AS_JUDGE:
return simpleqa_scorer.score(question, answer, true_answer)
Expand Down

0 comments on commit 5fa652d

Please sign in to comment.