Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend benchmarks with MATH subset #28

Merged
merged 4 commits into from
Jan 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,11 +108,11 @@ We [evaluated](evaluation) `freeact` using five state-of-the-art models:
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:

[<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):

[<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title in the plot doesn't mention m-ric/smol_agents_benchmark. Maybe something like

freeact performance on
m-ric/agents_medium_benchmark_2 (GAIA, GSM8K, SimpleQA)
ric/smol_agents_benchmark (MATH)

or similar, maybe with a better formatting/alignment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from this README should also go into docs/evaluation.md


Expand Down
Binary file modified docs/eval/eval-plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `free
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:

<figure markdown>
[![architecture](eval/eval-plot.png){ align="left" }](eval/eval-plot.png){target="_blank"}
</figure>

When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):

<figure markdown>
[![architecture](eval/eval-plot-comparison.png){ width="60%" align="left" }](eval/eval-plot-comparison.png){target="_blank"}
Expand Down
11 changes: 8 additions & 3 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,34 +8,39 @@ We evaluated `freeact` using five state-of-the-art models:
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)

The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:

[<img src="../docs/eval/eval-plot.png" alt="Performance">](../docs/eval/eval-plot.png)

| model | subset | eval_protocol | % correct |
|:---------------------------|:---------|:----------------|----------:|
| claude-3-5-sonnet-20241022 | GAIA | exact_match | **53.1** |
| claude-3-5-sonnet-20241022 | GSM8K | exact_match | **95.7** |
| claude-3-5-sonnet-20241022 | MATH | exact_match | **90.0** |
| claude-3-5-sonnet-20241022 | SimpleQA | exact_match | **57.5** |
| claude-3-5-sonnet-20241022 | SimpleQA | llm_as_judge | **72.5** |
| claude-3-5-haiku-20241022 | GAIA | exact_match | 31.2 |
| claude-3-5-haiku-20241022 | GSM8K | exact_match | 90.0 |
| claude-3-5-haiku-20241022 | MATH | exact_match | 76.0 |
| claude-3-5-haiku-20241022 | SimpleQA | exact_match | 52.5 |
| claude-3-5-haiku-20241022 | SimpleQA | llm_as_judge | 70.0 |
| gemini-2.0-flash-exp | GAIA | exact_match | 34.4 |
| gemini-2.0-flash-exp | GSM8K | exact_match | **95.7** |
| gemini-2.0-flash-exp | MATH | exact_match | 88.0 |
| gemini-2.0-flash-exp | SimpleQA | exact_match | 50.0 |
| gemini-2.0-flash-exp | SimpleQA | llm_as_judge | 65.0 |
| qwen2p5-coder-32b-instruct | GAIA | exact_match | 25.0 |
| qwen2p5-coder-32b-instruct | GSM8K | exact_match | **95.7** |
| qwen2p5-coder-32b-instruct | MATH | exact_match | 88.0 |
| qwen2p5-coder-32b-instruct | SimpleQA | exact_match | 52.5 |
| qwen2p5-coder-32b-instruct | SimpleQA | llm_as_judge | 65.0 |
| deepseek-v3 | GAIA | exact_match | 37.5 |
| deepseek-v3 | GSM8K | exact_match | 91.4 |
| deepseek-v3 | MATH | exact_match | 88.0 |
| deepseek-v3 | SimpleQA | exact_match | 60.0 |
| deepseek-v3 | SimpleQA | llm_as_judge | 67.5 |

When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):

[<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to upload the new pre-generated outputs.

Expand Down Expand Up @@ -107,7 +112,7 @@ python evaluation/evaluate.py \
--run-id deepseek-v3
```

Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18476491/evaluation-results-agents-3_medium_benchmark_2.zip).
Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18488186/evaluation-results-agents-4_medium_benchmark_2.zip).

## Analysis

Expand Down
17 changes: 12 additions & 5 deletions evaluation/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
"""

GSM8K_NORMALIZATION_PROMPT = """
GSM8K_MATH_NORMALIZATION_PROMPT = """
Finish your answer with the following template:
FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should only be a number. Don't use units such as $ or percent sign.
"""
Expand All @@ -53,6 +53,7 @@ class EvaluationSubset(StrEnum):
GSM8K = "GSM8K"
SIMPLEQA = "SimpleQA"
GAIA = "GAIA"
MATH = "MATH"


@app.command()
Expand Down Expand Up @@ -84,8 +85,14 @@ async def amain(

print(f"Output directory: {output_run_dir.absolute()}")

dataset = datasets.load_dataset("m-ric/agents_medium_benchmark_2")
dataset = dataset["train"]
dataset = datasets.concatenate_datasets(
[
datasets.load_dataset("m-ric/agents_medium_benchmark_2")["train"],
datasets.load_dataset("m-ric/smol_agents_benchmark")["test"].filter(
lambda example: example["source"] == "MATH"
),
]
)

if subset is not None:
_subset = str(subset) # convert to string avoid datasets warning
Expand Down Expand Up @@ -134,8 +141,8 @@ async def evaluate_agent(

source = example["source"]
try:
if source == "GSM8K":
normalization_prompt = GSM8K_NORMALIZATION_PROMPT
if source in ["GSM8K", "MATH"]:
normalization_prompt = GSM8K_MATH_NORMALIZATION_PROMPT
elif source == "GAIA":
normalization_prompt = GAIA_NORMALIZATION_PROMPT
elif source == "SimpleQA":
Expand Down
12 changes: 9 additions & 3 deletions evaluation/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,13 @@ def performance(
figsize=(10, 6),
palette="Blues_d",
hue="source_protocol",
hue_order=["GAIA (exact_match)", "GSM8K (exact_match)", "SimpleQA (exact_match)", "SimpleQA (llm_as_judge)"],
hue_order=[
"GAIA (exact_match)",
"GSM8K (exact_match)",
"MATH (exact_match)",
"SimpleQA (exact_match)",
"SimpleQA (llm_as_judge)",
],
title=f"freeact performance on {benchmark_display_name}",
output_file=output_dir / "eval-plot.png",
legend_location="top",
Expand Down Expand Up @@ -85,8 +91,8 @@ def create_barplot(
ax.spines["top"].set_visible(False)

if legend_location == "top":
plt.title(title, pad=50)
plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.05), loc="center", ncol=2)
plt.title(title, pad=70)
plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.10), loc="center", ncol=2)
else:
plt.title(title)
plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")
Expand Down
3 changes: 2 additions & 1 deletion evaluation/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ def score(
score_dataset(results_dir, "SimpleQA", EvalProtocol.LLM_AS_JUDGE),
score_dataset(results_dir, "SimpleQA", EvalProtocol.EXACT_MATCH),
score_dataset(results_dir, "GSM8K", EvalProtocol.EXACT_MATCH),
score_dataset(results_dir, "MATH", EvalProtocol.EXACT_MATCH),
]
)
all_dfs.append(df)
Expand Down Expand Up @@ -125,7 +126,7 @@ def is_correct(example, simpleqa_scorer: SimpleQAScorer, eval_protocol: EvalProt
question = str(example["question"])

match example["source"]:
case "GSM8K":
case "GSM8K" | "MATH":
return get_question_score_gsm8k(answer, true_answer)
case "SimpleQA" if eval_protocol == EvalProtocol.LLM_AS_JUDGE:
return simpleqa_scorer.score(question, answer, true_answer)
Expand Down
Loading