gradion-ai · cstub · Jan 21, 2025 · Jan 20, 2025 · Jan 21, 2025 · Jan 21, 2025
diff --git a/README.md b/README.md
@@ -108,11 +108,11 @@ We [evaluated](evaluation) `freeact` using five state-of-the-art models:
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
 - DeepSeek V3 (`deepseek-v3`)
 
-The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
+The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
 
 [<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)
 
-When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
+When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):
 
 [<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)
 

diff --git a/docs/eval/eval-plot.png b/docs/eval/eval-plot.png
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -8,13 +8,13 @@ We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `free
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
 - DeepSeek V3 (`deepseek-v3`)
 
-The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
+The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
 
 <figure markdown>
   [![architecture](eval/eval-plot.png){ align="left" }](eval/eval-plot.png){target="_blank"}
 </figure>
 
-When comparing our results with smolagents using Claude 3.5 Sonnet, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
+When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):
 
 <figure markdown>
   [![architecture](eval/eval-plot-comparison.png){ width="60%" align="left" }](eval/eval-plot-comparison.png){target="_blank"}

diff --git a/evaluation/README.md b/evaluation/README.md
@@ -8,34 +8,39 @@ We evaluated `freeact` using five state-of-the-art models:
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
 - DeepSeek V3 (`deepseek-v3`)
 
-The evaluation was performed on the [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) dataset, developed by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face. It comprises selected tasks from GAIA, GSM8K, and SimpleQA:
+The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
 
 [<img src="../docs/eval/eval-plot.png" alt="Performance">](../docs/eval/eval-plot.png)
 
 | model                      | subset   | eval_protocol   | % correct |
 |:---------------------------|:---------|:----------------|----------:|
 | claude-3-5-sonnet-20241022 | GAIA     | exact_match     |  **53.1** |
 | claude-3-5-sonnet-20241022 | GSM8K    | exact_match     |  **95.7** |
+| claude-3-5-sonnet-20241022 | MATH     | exact_match     |  **90.0** |
 | claude-3-5-sonnet-20241022 | SimpleQA | exact_match     |  **57.5** |
 | claude-3-5-sonnet-20241022 | SimpleQA | llm_as_judge    |  **72.5** |
 | claude-3-5-haiku-20241022  | GAIA     | exact_match     |      31.2 |
 | claude-3-5-haiku-20241022  | GSM8K    | exact_match     |      90.0 |
+| claude-3-5-haiku-20241022  | MATH     | exact_match     |      76.0 |
 | claude-3-5-haiku-20241022  | SimpleQA | exact_match     |      52.5 |
 | claude-3-5-haiku-20241022  | SimpleQA | llm_as_judge    |      70.0 |
 | gemini-2.0-flash-exp       | GAIA     | exact_match     |      34.4 |
 | gemini-2.0-flash-exp       | GSM8K    | exact_match     |  **95.7** |
+| gemini-2.0-flash-exp       | MATH     | exact_match     |      88.0 |
 | gemini-2.0-flash-exp       | SimpleQA | exact_match     |      50.0 |
 | gemini-2.0-flash-exp       | SimpleQA | llm_as_judge    |      65.0 |
 | qwen2p5-coder-32b-instruct | GAIA     | exact_match     |      25.0 |
 | qwen2p5-coder-32b-instruct | GSM8K    | exact_match     |  **95.7** |
+| qwen2p5-coder-32b-instruct | MATH     | exact_match     |      88.0 |
 | qwen2p5-coder-32b-instruct | SimpleQA | exact_match     |      52.5 |
 | qwen2p5-coder-32b-instruct | SimpleQA | llm_as_judge    |      65.0 |
 | deepseek-v3                | GAIA     | exact_match     |      37.5 |
 | deepseek-v3                | GSM8K    | exact_match     |      91.4 |
+| deepseek-v3                | MATH     | exact_match     |      88.0 |
 | deepseek-v3                | SimpleQA | exact_match     |      60.0 |
 | deepseek-v3                | SimpleQA | llm_as_judge    |      67.5 |
 
-When comparing our results with smolagents using `claude-3-5-sonnet-20241022`, we observed the following outcomes (evaluation conducted on 2025-01-07, reference data [here](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)):
+When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):
 
 [<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)
 
@@ -107,7 +112,7 @@ python evaluation/evaluate.py \
     --run-id deepseek-v3
 ```
 
-Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18476491/evaluation-results-agents-3_medium_benchmark_2.zip).
+Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18488186/evaluation-results-agents-4_medium_benchmark_2.zip).
 
 ## Analysis
 

diff --git a/evaluation/evaluate.py b/evaluation/evaluate.py
@@ -37,7 +37,7 @@
 If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
 """
 
-GSM8K_NORMALIZATION_PROMPT = """
+GSM8K_MATH_NORMALIZATION_PROMPT = """
 Finish your answer with the following template:
 FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should only be a number. Don't use units such as $ or percent sign.
 """
@@ -53,6 +53,7 @@ class EvaluationSubset(StrEnum):
     GSM8K = "GSM8K"
     SIMPLEQA = "SimpleQA"
     GAIA = "GAIA"
+    MATH = "MATH"
 
 
 @app.command()
@@ -84,8 +85,14 @@ async def amain(
 
     print(f"Output directory: {output_run_dir.absolute()}")
 
-    dataset = datasets.load_dataset("m-ric/agents_medium_benchmark_2")
-    dataset = dataset["train"]
+    dataset = datasets.concatenate_datasets(
+        [
+            datasets.load_dataset("m-ric/agents_medium_benchmark_2")["train"],
+            datasets.load_dataset("m-ric/smol_agents_benchmark")["test"].filter(
+                lambda example: example["source"] == "MATH"
+            ),
+        ]
+    )
 
     if subset is not None:
         _subset = str(subset)  # convert to string avoid datasets warning
@@ -134,8 +141,8 @@ async def evaluate_agent(
 
         source = example["source"]
         try:
-            if source == "GSM8K":
-                normalization_prompt = GSM8K_NORMALIZATION_PROMPT
+            if source in ["GSM8K", "MATH"]:
+                normalization_prompt = GSM8K_MATH_NORMALIZATION_PROMPT
             elif source == "GAIA":
                 normalization_prompt = GAIA_NORMALIZATION_PROMPT
             elif source == "SimpleQA":

diff --git a/evaluation/report.py b/evaluation/report.py
@@ -39,7 +39,13 @@ def performance(
         figsize=(10, 6),
         palette="Blues_d",
         hue="source_protocol",
-        hue_order=["GAIA (exact_match)", "GSM8K (exact_match)", "SimpleQA (exact_match)", "SimpleQA (llm_as_judge)"],
+        hue_order=[
+            "GAIA (exact_match)",
+            "GSM8K (exact_match)",
+            "MATH (exact_match)",
+            "SimpleQA (exact_match)",
+            "SimpleQA (llm_as_judge)",
+        ],
         title=f"freeact performance on {benchmark_display_name}",
         output_file=output_dir / "eval-plot.png",
         legend_location="top",
@@ -85,8 +91,8 @@ def create_barplot(
     ax.spines["top"].set_visible(False)
 
     if legend_location == "top":
-        plt.title(title, pad=50)
-        plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.05), loc="center", ncol=2)
+        plt.title(title, pad=70)
+        plt.legend(fontsize=10, bbox_to_anchor=(0.5, 1.10), loc="center", ncol=2)
     else:
         plt.title(title)
         plt.legend(fontsize=10, bbox_to_anchor=(1.05, 0.5), loc="center left")

diff --git a/evaluation/score.py b/evaluation/score.py
@@ -41,6 +41,7 @@ def score(
                 score_dataset(results_dir, "SimpleQA", EvalProtocol.LLM_AS_JUDGE),
                 score_dataset(results_dir, "SimpleQA", EvalProtocol.EXACT_MATCH),
                 score_dataset(results_dir, "GSM8K", EvalProtocol.EXACT_MATCH),
+                score_dataset(results_dir, "MATH", EvalProtocol.EXACT_MATCH),
             ]
         )
         all_dfs.append(df)
@@ -125,7 +126,7 @@ def is_correct(example, simpleqa_scorer: SimpleQAScorer, eval_protocol: EvalProt
     question = str(example["question"])
 
     match example["source"]:
-        case "GSM8K":
+        case "GSM8K" | "MATH":
             return get_question_score_gsm8k(answer, true_answer)
         case "SimpleQA" if eval_protocol == EvalProtocol.LLM_AS_JUDGE:
             return simpleqa_scorer.score(question, answer, true_answer)