Skip to content

om-ai-lab/open-agent-leaderboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ… Open Agent Leaderboard

πŸ€— HF Leaderboard

πŸŽ‰ Updates

  • 2025/2/11: Add deepseek-r1:1.5b, a new dataset MATH-500, and a new algorithm ToT into the leaderboard.
  • 2025/1/23: Add gpt-4o, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, Qwen2-1.5B-Instruct, Qwen2-0.5B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Internllm2_5-7B into the leaderboard.
  • 2025/1/07: The Open Agent Leaderboard is released.

πŸ“– Introduction

This project aims to provide a fair comparison of various agents by evaluating their performance on different datasets and LLMs. Built on top of the OmAgent framework, it allows for simple, quick, and accurate assessments of agents.

Supported benchmark datasets:

Supported algorithms:

Supported LLMs:

  • gpt-3.5-turbo
  • gpt-4o
  • Doubao-lite-32k
  • Qwen2.5-72B-Instruct
  • Qwen2.5-7B-Instruct
  • Qwen2-1.5B-Instruct
  • Qwen2-0.5B-Instruct
  • Llama-3.3-70B-Instruct
  • Llama-3.1-8B-Instruct
  • Internllm2_5-7B
  • deepseek-r1:1.5b

πŸ… Leaderboards

Math tasks

Rank Algorithm LLM Eval Date Avg Score gsm8k-Score gsm8k-Cost($) AQuA-Score AQuA-Cost($) MATH-500-Score MATH-500-Cost($)
1 CoT Qwen2.5-72B-Instruct 2025/1/22 86.43 92.87 0.7195 86.22 0.0808 80.2 0.349
2 SC-CoT Qwen2.5-72B-Instruct 2025/1/22 84.3 93.86 5.9858 85.04 1.0348 74 3.1556
3 SC-CoT Llama-3.3-70B-Instruct 2025/1/22 83.85 95.07 6.2005 82.28 1.0756 74.2 3.2239
4 CoT Llama-3.3-70B-Instruct 2025/1/22 82.86 93.93 0.687 83.46 0.0927 71.2 0.3463
5 CoT gpt-4o 2025/1/22 81.59 94.09 4.5367 82.68 1.0417 68 3.0569
6 IO Llama-3.3-70B-Instruct 2025/1/22 81.45 92.27 0.4709 82.68 0.0798 69.4 0.2386
7 IO Qwen2.5-72B-Instruct 2025/1/22 80.34 86.58 0.4899 84.25 0.0742 70.2 0.2506
8 SC-CoT Qwen2.5-7B-Instruct 2025/1/22 79.35 91.13 0 79.92 0 67 0
9 CoT Qwen2.5-7B-Instruct 2025/1/22 78.73 85.67 0 80.71 0 69.8 0
10 ReAct-Pro* Llama-3.3-70B-Instruct 2025/1/22 77.12 87.64 10.1124 79.13 0.768 64.6 3.1806
11 CoT Doubao-lite-32k 2025/1/7 77 89.31 0.0558 82.68 0.0066 59 0.0255
12 ReAct-Pro* Qwen2.5-72B-Instruct 2025/1/22 74.43 87.26 10.5479 73.23 0.3177 62.8 3.4541
13 SC-CoT Doubao-lite-32k 2025/1/7 72.52 87.26 0.2083 81.1 0.0519 49.2 0.1406
14 PoT Qwen2.5-72B-Instruct 2025/1/22 71.58 92.34 0.7054 75.2 0.1645 47.2 0.233
15 PoT gpt-4o 2025/1/22 71.5 93.1 4.2166 75.2 1.6087 46.2 1.5994
16 SC-CoT gpt-4o 2025/1/22 70.44 90.3 31.0542 86.61 8.1485 34.4 19.6538
17 ReAct-Pro* Doubao-lite-32k 2025/1/7 70.12 85.6 0.2512 77.56 0.0445 47.2 0.186
18 ReAct-Pro* Qwen2.5-7B-Instruct 2025/1/22 68.69 82.87 0 74.41 0 48.8 0
19 IO gpt-4o 2025/1/22 68.6 88.4 3.3463 75.59 1.1453 41.8 2.7907
20 IO Qwen2.5-7B-Instruct 2025/1/22 65.13 57.24 0 78.74 0 59.4 0
21 PoT Llama-3.3-70B-Instruct 2025/1/22 65.07 73.09 0.9736 79.53 0.1746 42.6 0.2839
22 CoT deepseek-r1:1.5b 2025/1/23 63.9 70.66 0 71.65 0 49.4 0
23 IO Doubao-lite-32k 2025/1/7 62.85 72.02 0.0354 79.13 0.0058 37.4 0.0187
24 PoT Doubao-lite-32k 2025/1/7 61.29 79.61 0.0576 71.65 0.0147 32.6 0.0144
25 ToT Qwen2.5-72B-Instruct 2025/1/22 60.26 88.88 23.5911 81.1 3.7389 10.8 9.0421
26 CoT gpt-3.5-turbo 2025/1/7 59.84 78.7 0.6788 61.02 0.0957 39.8 0.3189
27 CoT Internllm2_5-7B 2025/1/22 59.02 77.71 0 52.76 0 46.6 0
28 IO deepseek-r1:1.5b 2025/1/22 58.95 64.14 0 68.9 0 43.8 0
29 ToT Llama-3.3-70B-Instruct 2025/1/22 58.79 91.89 20.8753 83.07 2.9404 1.4 8.2699
30 ToT gpt-4o 2025/1/22 58.61 91.13 86.8581 81.5 8.5295 3.2 40.8094
31 SC-CoT gpt-3.5-turbo 2025/1/7 58.28 79.91 3.3938 66.14 0.7888 28.8 1.9764
32 ReAct-Pro* gpt-4o 2025/1/22 58.26 63.31 39.0751 57.48 2.304 54 17.7735
33 PoT Qwen2.5-7B-Instruct 2025/1/22 55.51 58.83 0 68.11 0 39.6 0
34 PoT gpt-3.5-turbo 2025/1/7 55.04 76.88 0.6902 59.45 0.1748 28.8 0.168
35 ReAct-Pro* gpt-3.5-turbo 2025/1/7 54.43 74.91 3.4633 64.57 0.4928 23.8 2.0406
36 SC-CoT Llama-3.1-8B-Instruct 2025/1/22 54.37 73.46 0 59.45 0 30.2 0
37 CoT Llama-3.1-8B-Instruct 2025/1/22 53.96 75.44 0 60.63 0 25.8 0
38 SC-CoT deepseek-r1:1.5b 2025/2/10 50.8 55.34 0 59.06 0 38 0
39 ReAct-Pro* Llama-3.1-8B-Instruct 2025/1/22 50.7 67.78 0 55.51 0 28.8 0
40 IO Llama-3.1-8B-Instruct 2025/1/22 48.98 57.16 0 51.18 0 38.6 0
41 ToT gpt-3.5-turbo 2025/1/7 44.94 67.93 9.1707 57.09 1.1513 9.8 5.2914
42 ToT Qwen2.5-7B-Instruct 2025/1/22 42.52 72.21 0 53.94 0 1.4 0
43 ToT Llama-3.1-8B-Instruct 2025/1/22 41.97 65.05 0 59.06 0 1.8 0
44 ReAct-Pro* deepseek-r1:1.5b 2025/2/10 38.22 35.94 0 54.33 0 24.4 0
45 CoT Qwen2-1.5B-Instruct 2025/1/22 37.08 55.5 0 40.55 0 15.2 0
46 PoT Llama-3.1-8B-Instruct 2025/1/22 33.56 38.67 0 36.61 0 25.4 0
47 SC-CoT Internllm2_5-7B 2025/1/22 32.46 48.22 0 39.37 0 9.8 0
48 IO gpt-3.5-turbo 2025/1/7 31.34 37.83 0.3328 38.98 0.038 17.2 0.2436
49 PoT Internllm2_5-7B 2025/1/22 29.94 38.21 0 36.61 0 15 0
50 ReAct-Pro* Internllm2_5-7B 2025/1/22 29.75 33.51 0 40.94 0 14.8 0
51 ToT Doubao-lite-32k 2025/1/7 28.1 37.83 0.8739 45.28 0.0881 1.2 0.2371
52 IO Internllm2_5-7B 2025/1/22 27.35 11.6 0 47.64 0 22.8 0
53 CoT Qwen2-0.5B-Instruct 2025/1/22 25.07 35.94 0 33.07 0 6.2 0
54 PoT deepseek-r1:1.5b 2025/2/10 22.54 11.9 0 54.72 0 1 0
55 ReAct-Pro* Qwen2-1.5B-Instruct 2025/1/22 19.55 24.87 0 25.59 0 8.2 0
56 ToT Internllm2_5-7B 2025/1/22 18.96 20.85 0 35.83 0 0.2 0
57 IO Qwen2-1.5B-Instruct 2025/1/22 17.6 16.68 0 29.13 0 7 0
58 ToT Qwen2-1.5B-Instruct 2025/1/22 17.31 19.64 0 31.5 0 0.8 0
59 PoT Qwen2-1.5B-Instruct 2025/1/22 16.67 18.5 0 30.71 0 0.8 0
60 ToT deepseek-r1:1.5b 2025/2/10 16.11 23.12 0 24.8 0 0.4 0
61 IO Qwen2-0.5B-Instruct 2025/1/22 14.83 14.71 0 27.17 0 2.6 0
62 SC-CoT Qwen2-1.5B-Instruct 2025/1/22 13.06 11.75 0 23.62 0 3.8 0
63 ReAct-Pro* Qwen2-0.5B-Instruct 2025/1/22 10.76 7.66 0 24.02 0 0.6 0
64 ToT Qwen2-0.5B-Instruct 2025/1/22 9.97 0 0 29.92 0 0 0
65 PoT Qwen2-0.5B-Instruct 2025/1/22 8.98 9.63 0 17.32 0 0 0
66 SC-CoT Qwen2-0.5B-Instruct 2025/1/22 8.43 1.67 0 22.83 0 0.8 0

Evaluation details can be found in the Evaluation Details section and huggingface leaderboard.

  • IO (Input-Output) is the baseline method that directly prompts the model with the question and expects an answer without any intermediate reasoning steps. It represents the most basic way of using language models and serves as a reference point for evaluating the effectiveness of other algorithms.

  • ReAct-Pro*: We modified ReAct to ReAct-Pro, following the Reflexion repository. Comparasion with the original ReAct repo can be found in the Compare to ReAct section.

Leaderboard Visualization

πŸ› οΈ How to Install

  1. Clone the repository:

    git clone https://github.com/om-ai-lab/open-agent-leaderboard.git
    cd open-agent-leaderboard
  2. Install dependencies:

    pip install -r requirements.txt

πŸ—οΈ How to Evaluate Agents

Step 1. Implement your agent in the omagent repository

Navigate to the agent repository:

git clone https://github.com/om-ai-lab/OmAgent.git
cd OmAgent

Set up the environment:

pip install -e omagent-core

Implement your agent in the omagent repository, check the examples/cot folder.

Step 2. Inference in OmAgent Repository

Run the inference script (cot as an example):

cd examples/cot
python eval_demo.py --model_id your_model_id --dataset_name your_dataset_name --dataset_path your_dataset_path --output_path your_output_path --output_name your_output_name --cot_method your_cot_method

Output Format

The output results are saved in JSON format and include the following fields:

  • id: The unique identifier of the sample.
  • question: The input question provided to the model.
  • last_output: The raw output generated by the model.
  • output_postprocess (optional): The processed output after cleansing.
  • ground_truth (optional): The correct answer for the sample.
  • prompt_tokens: The number of tokens in the input prompt.
  • completion_tokens: The number of tokens in the model's output.

Example of an output JSON file:

{
  "dataset": "gsm8k",
  "model_id": "gpt-3.5-turbo",
  "alg": "COT",
  "model_result": [
    {
      "id": 1,
      "question": "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today.....",
      "last_output": "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and uses 4 to bake muffins,...",
      "output_postprocess": "Paris",
      "ground_truth": "Paris",
      "prompt_tokens": 10,
      "completion_tokens": 5
    }
  ]
}

Step 3. Evaluate inference results

Run the main script to perform evaluations:

python main.py --dataset <dataset_name> --model <model_name> --method <method_name> --output_dir <output_directory>

Parameters

  • --random_seed: Random seed, default is 1.
  • --dataset: Dataset to use, options are aqua, gsm8k, math500.
  • --minibatch_size: Minibatch size, default is 1.
  • --max_num_worker: Maximum number of workers for the data loader, default is 4.
  • --model: Model used for decoding, options are gpt-4o-mini, gpt-4o, gpt-3.5-turbo.
  • --method: Method, options are zero_shot, zero_shot_cot, few_shot, few_shot_cot.
  • --cot_trigger_no: Trigger sentence number for chain of thought, default is 1.
  • --max_length: Maximum length of model output, default is 2048.
  • --max_length_direct: Maximum length of direct model answer, default is 32.
  • --limit_dataset_size: Whether to limit the test dataset size, default is 0 (no limit).
  • --output_dir: Output directory, default is ./outputs/.
  • --output_path: Output path, default is empty.
  • --agent: Agent used for the experiment, options are cot, pot, sc_cot, react.
  • --system_prompt: System prompt, default is empty.
  • --openai_api_key: OpenAI API key, default is empty.
  • --openai_url: OpenAI API URL, default is https://api.openai.com/v1.

Example

python main.py --output_path example/gsm8k_results_cot.json --dataset gsm8k --method few_shot_cot

Evaluation details

Algorithm Dataset Eval Date LLM Score Pass rate X-shot Parameters Samples Total input tokens Average input tokens Total output tokens Average output tokens All tokens Cost($)
IO gsm8k 2025/1/7 gpt-3.5-turbo 37.83 99.92 8 1,319 546,990 415 39,563 30 586,553 0.3328
IO gsm8k 2025/1/7 Doubao-lite-32k 72.02 99.92 8 1,319 617,377 468 123,106 93 740,483 0.0354
IO gsm8k 2025/1/22 gpt-4o 88.4 100 8 1,319 542,416 411 199,030 151 741,446 3.3463
IO gsm8k 2025/1/22 Qwen2.5-72B-Instruct 86.58 100 8 1,319 555,340 421 313,720 238 869,060 0.4899
IO gsm8k 2025/1/22 Llama-3.3-70B-Instruct 92.27 100 8 1,319 583,916 443 251,359 191 835,275 0.4709
IO gsm8k 2025/1/22 Qwen2.5-7B-Instruct 57.24 100 8 1,319 596,229 452 291,684 221 887,913 0
IO gsm8k 2025/1/22 Llama-3.1-8B-Instruct 57.16 99.55 8 1,319 550,941 418 1,194,488 906 1,745,429 0
IO gsm8k 2025/1/22 Internllm2_5-7B 11.6 97.95 8 1,319 679,302 515 434,426 329 1,113,728 0
IO gsm8k 2025/1/22 Qwen2-1.5B-Instruct 16.68 100 8 1,319 568,530 431 168,466 128 736,996 0
IO gsm8k 2025/1/22 Qwen2-0.5B-Instruct 14.71 100 8 1,319 568,116 431 266,781 202 834,897 0
IO gsm8k 2025/1/22 deepseek-r1:1.5b 64.14 99.62 8 1,319 561,935 426 921,116 698 1,483,051 0
ReAct-Pro* gsm8k 2025/1/7 gpt-3.5-turbo 74.91 99.39 8 max_steps=10 1,319 6,506,164 4,933 140,122 106 6,646,286 3.4633
ReAct-Pro* gsm8k 2025/1/7 Doubao-lite-32k 85.6 99.62 8 max_steps=10 1,319 5,862,016 4,444 136,623 104 5,998,639 0.2512
ReAct-Pro* gsm8k 2025/1/22 gpt-4o 63.31 99.55 8 max_steps=10 1,319 14,411,173 10,926 304,714 231 14,715,887 39.0751
ReAct-Pro* gsm8k 2025/1/22 Qwen2.5-72B-Instruct 87.26 100 8 max_steps=10 1,319 18,160,983 13,769 549,454 417 18,710,437 10.5479
ReAct-Pro* gsm8k 2025/1/22 Llama-3.3-70B-Instruct 87.64 99.92 8 max_steps=10 1,319 17,038,928 12,918 898,936 682 17,937,864 10.1124
ReAct-Pro* gsm8k 2025/1/22 Qwen2.5-7B-Instruct 82.87 100 8 max_steps=10 1,319 14,355,752 10,884 495,162 375 14,850,914 0
ReAct-Pro* gsm8k 2025/1/22 Llama-3.1-8B-Instruct 67.78 98.56 8 max_steps=10 1,319 21,044,978 15,955 1,790,789 1,358 22,835,767 0
ReAct-Pro* gsm8k 2025/1/22 Internllm2_5-7B 33.51 97.95 8 max_steps=10 1,319 30,120,070 22,836 5,549,919 4,208 35,669,989 0
ReAct-Pro* gsm8k 2025/1/22 Qwen2-1.5B-Instruct 24.87 80.21 8 max_steps=10 1,319 9,133,603 6,925 694,398 526 9,828,001 0
ReAct-Pro* gsm8k 2025/1/22 Qwen2-0.5B-Instruct 7.66 95.22 8 max_steps=10 1,319 52,431,343 39,751 2,961,268 2,245 55,392,611 0
ReAct-Pro* gsm8k 2025/2/10 deepseek-r1:1.5b 35.94 99.62 8 max_steps=10 1,319 19,299,381 14,632 4,919,696 3,730 24,219,077 0
PoT gsm8k 2025/1/7 gpt-3.5-turbo 76.88 99.24 8 1,319 1,090,418 827 96,662 73 1,187,080 0.6902
PoT gsm8k 2025/1/7 Doubao-lite-32k 79.61 92.57 8 1,319 1,170,038 887 118,017 89 1,288,055 0.0576
PoT gsm8k 2025/1/22 gpt-4o 93.1 99.77 8 1,319 1,101,672 835 146,240 111 1,247,912 4.2166
PoT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 92.34 99.39 8 1,319 1,106,682 839 144,528 110 1,251,210 0.7054
PoT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 73.09 79.61 8 1,319 1,126,025 854 601,019 456 1,727,044 0.9736
PoT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 58.83 70.51 8 1,319 1,145,390 868 217,432 165 1,362,822 0
PoT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 38.67 55.42 8 1,319 1,147,538 870 243,573 185 1,391,111 0
PoT gsm8k 2025/1/22 Internllm2_5-7B 38.21 48.9 8 1,319 1,136,843 862 188,106 143 1,324,949 0
PoT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 18.5 31.01 8 1,319 1,151,528 873 175,994 133 1,327,522 0
PoT gsm8k 2025/1/22 Qwen2-0.5B-Instruct 9.63 16.91 8 1,319 1,151,528 873 237,607 180 1,389,135 0
PoT gsm8k 2025/2/10 deepseek-r1:1.5b 11.9 17.44 8 1,319 1,138,872 863 815,637 618 1,954,509 0
CoT gsm8k 2025/1/7 gpt-3.5-turbo 78.7 100 8 1,319 953,242 723 134,799 102 1,088,041 0.6788
CoT gsm8k 2025/1/7 Doubao-lite-32k 89.31 100 8 1,319 1,042,095 790 159,725 121 1,201,820 0.0558
CoT gsm8k 2025/1/22 gpt-4o 94.09 100 8 1,319 948,668 719 216,498 164 1,165,166 4.5367
CoT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 92.87 100 8 1,319 1,005,119 762 271,133 206 1,276,252 0.7195
CoT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 93.93 100 8 1,319 990,168 751 228,497 173 1,218,665 0.687
CoT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 85.67 100 8 1,319 1,046,008 793 244,797 186 1,290,805 0
CoT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 75.44 99.92 8 1,319 990,168 751 258,161 196 1,248,329 0
CoT gsm8k 2025/1/22 Internllm2_5-7B 77.71 99.7 8 1,319 968,163 734 234,000 177 1,202,163 0
CoT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 55.5 100 8 1,319 1,032,818 783 185,707 141 1,218,525 0
CoT gsm8k 2025/1/22 Qwen2-0.5B-Instruct 35.94 99.92 8 1,319 1,032,818 783 190,641 145 1,223,459 0
CoT gsm8k 2025/1/23 deepseek-r1:1.5b 70.66 99.77 8 1,319 1,011,714 767 1,078,911 818 2,090,625 0
SC-CoT gsm8k 2025/1/7 gpt-3.5-turbo 79.91 99.92 8 temperature=1, path_num=5 1,319 2,740,652 2,078 1,348,960 1,023 4,089,612 3.3938
SC-CoT gsm8k 2025/1/7 Doubao-lite-32k 87.26 99.92 8 temperature=1, path_num=5 1,319 2,691,714 2,041 1,197,099 908 3,888,813 0.2083
SC-CoT gsm8k 2025/1/22 gpt-4o 90.3 99.92 8 temperature=1, path_num=5 1,319 3,590,336 2,722 2,207,837 1,674 5,798,173 31.0542
SC-CoT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 93.86 100 8 temperature=1, path_num=5 1,319 8,136,223 6,168 2,481,785 1,882 10,618,008 5.9858
SC-CoT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 95.07 100 8 temperature=1, path_num=5 1,319 8,413,717 6,379 2,585,077 1,960 10,998,794 6.2005
SC-CoT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 91.13 100 8 temperature=1, path_num=5 1,319 8,586,888 6,510 2,554,097 1,936 11,140,985 0
SC-CoT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 73.46 99.55 8 temperature=1, path_num=5 1,319 8,630,514 6,543 3,148,202 2,387 11,778,716 0
SC-CoT gsm8k 2025/1/22 Internllm2_5-7B 48.22 98.41 8 temperature=1, path_num=5 1,319 10,678,792 8,096 3,847,639 2,917 14,526,431 0
SC-CoT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 11.75 91.89 8 temperature=1, path_num=5 1,319 9,066,115 6,873 3,345,827 2,537 12,411,942 0
SC-CoT gsm8k 2025/1/22 Qwen2-0.5B-Instruct 1.67 94.69 8 temperature=1, path_num=5 1,319 11,019,864 8,355 5,445,856 4,129 16,465,720 0
SC-CoT gsm8k 2025/2/10 deepseek-r1:1.5b 55.34 99.7 8 temperature=1, path_num=5 1,319 14,540,096 11,024 11,245,769 8,526 25,785,865 0
ToT gsm8k 2025/1/7 gpt-3.5-turbo 67.93 99.7 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 15,920,037 12,070 807,138 612 16,727,175 9.1707
ToT gsm8k 2025/1/7 Doubao-lite-32k 37.83 87.34 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 19,208,597 14,563 1,065,752 808 20,274,349 0.8739
ToT gsm8k 2025/1/22 gpt-4o 91.13 100 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 29,445,237 22,324 1,324,498 1,004 30,769,735 86.8581
ToT gsm8k 2025/1/22 Qwen2.5-72B-Instruct 88.88 100 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 40,435,361 30,656 1,411,787 1,070 41,847,148 23.5911
ToT gsm8k 2025/1/22 Llama-3.3-70B-Instruct 91.89 100 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 35,096,810 26,609 1,932,877 1,465 37,029,687 20.8753
ToT gsm8k 2025/1/22 Qwen2.5-7B-Instruct 72.21 99.01 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 20,196,528 15,312 11,460,791 8,689 31,657,319 0
ToT gsm8k 2025/1/22 Llama-3.1-8B-Instruct 65.05 91.96 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 15,554,967 11,793 877,135 665 16,432,102 0
ToT gsm8k 2025/1/22 Internllm2_5-7B 20.85 70.13 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 11,768,118 8,922 1,410,011 1,069 13,178,129 0
ToT gsm8k 2025/1/22 Qwen2-1.5B-Instruct 19.64 77.26 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 12,124,248 9,192 634,439 481 12,758,687 0
ToT gsm8k 2025/1/22 Qwen2-0.5B-Instruct - - 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 - - - - - -
ToT gsm8k 2025/2/10 deepseek-r1:1.5b 23.12 72.48 8 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 1,319 2,738,244 2,076 683,242 518 3,421,486 0
IO AQuA 2025/1/7 gpt-3.5-turbo 38.98 100 0 254 25,701 101 16,770 66 42,471 0.038
IO AQuA 2025/1/7 Doubao-lite-32k 79.13 100 0 254 33,058 130 54,684 215 87,742 0.0058
IO AQuA 2025/1/22 gpt-4o 75.59 97.24 0 254 25,631 101 108,121 426 133,752 1.1453
IO AQuA 2025/1/22 Qwen2.5-72B-Instruct 84.25 99.61 0 254 25,397 100 106,207 418 131,604 0.0742
IO AQuA 2025/1/22 Llama-3.3-70B-Instruct 82.68 99.21 0 254 32,809 129 108,758 428 141,567 0.0798
IO AQuA 2025/1/22 Qwen2.5-7B-Instruct 78.74 98.43 0 254 33,271 131 104,500 411 137,771 0
IO AQuA 2025/1/22 Llama-3.1-8B-Instruct 51.18 98.82 0 254 26,459 104 106,647 420 133,106 0
IO AQuA 2025/1/22 Internllm2_5-7B 47.64 90.94 0 254 50,232 198 134,809 531 185,041 0
IO AQuA 2025/1/22 Qwen2-1.5B-Instruct 29.13 97.64 0 254 27,937 110 43,110 170 71,047 0
IO AQuA 2025/1/22 Qwen2-0.5B-Instruct 27.17 98.82 0 254 27,937 110 82,478 325 110,415 0
IO AQuA 2025/1/22 deepseek-r1:1.5b 68.9 94.88 0 254 26,667 105 325,100 1,280 351,767 0
CoT AQuA 2025/1/7 gpt-3.5-turbo 61.02 93.7 0 254 25,447 100 55,346 218 80,793 0.0957
CoT AQuA 2025/1/7 Doubao-lite-32k 82.68 97.24 0 254 27,978 110 66,599 262 94,577 0.0066
CoT AQuA 2025/1/22 gpt-4o 82.68 98.03 0 254 25,123 99 97,894 385 123,017 1.0417
CoT AQuA 2025/1/22 Qwen2.5-72B-Instruct 86.22 99.21 0 254 25,143 99 118,146 465 143,289 0.0808
CoT AQuA 2025/1/22 Llama-3.3-70B-Instruct 83.46 98.43 0 254 32,555 128 131,834 519 164,389 0.0927
CoT AQuA 2025/1/22 Qwen2.5-7B-Instruct 80.71 99.61 0 254 33,017 130 116,719 460 149,736 0
CoT AQuA 2025/1/22 Llama-3.1-8B-Instruct 60.63 100 0 254 32,555 128 111,880 440 144,435 0
CoT AQuA 2025/1/22 Internllm2_5-7B 52.76 89.37 0 254 26,610 105 100,910 397 127,520 0
CoT AQuA 2025/1/22 Qwen2-1.5B-Instruct 40.55 98.82 0 254 30,477 120 79,563 313 110,040 0
CoT AQuA 2025/1/22 Qwen2-0.5B-Instruct 33.07 98.82 0 254 30,477 120 86,862 342 117,339 0
CoT AQuA 2025/1/23 deepseek-r1:1.5b 71.65 96.85 0 254 26,413 104 306,659 1,207 333,072 0
PoT AQuA 2025/1/7 gpt-3.5-turbo 59.45 100 0 254 225,162 886 41,492 163 266,654 0.1748
PoT AQuA 2025/1/7 Doubao-lite-32k 71.65 96.85 0 254 259,863 1,023 49,573 195 309,436 0.0147
PoT AQuA 2025/1/22 gpt-4o 75.2 100 0 254 222,717 877 105,191 414 327,908 1.6087
PoT AQuA 2025/1/22 Qwen2.5-72B-Instruct 75.2 100 0 254 249,215 981 42,549 168 291,764 0.1645
PoT AQuA 2025/1/22 Llama-3.3-70B-Instruct 79.53 99.21 0 254 240,735 948 69,064 272 309,799 0.1746
PoT AQuA 2025/1/22 Qwen2.5-7B-Instruct 68.11 100 0 254 264,517 1,041 49,211 194 313,728 0
PoT AQuA 2025/1/22 Llama-3.1-8B-Instruct 36.61 96.85 0 254 240,613 947 50,301 198 290,914 0
PoT AQuA 2025/1/22 Internllm2_5-7B 36.61 98.82 0 254 233,505 919 68,457 270 301,962 0
PoT AQuA 2025/1/22 Qwen2-1.5B-Instruct 30.71 96.46 0 254 246,560 971 51,915 204 298,475 0
PoT AQuA 2025/1/22 Qwen2-0.5B-Instruct 17.32 92.13 0 254 258,867 1,019 63,414 250 322,281 0
PoT AQuA 2025/2/10 deepseek-r1:1.5b 54.72 97.24 0 254 250,690 987 765,957 3,016 1,016,647 0
SC-CoT AQuA 2025/1/22 gpt-3.5-turbo 66.14 99.21 0 temperature=1, path_num=5 254 482,192 1,898 365,143 1,438 847,335 0.7888
SC-CoT AQuA 2025/1/22 Doubao-lite-32k 81.1 97.24 0 temperature=1, path_num=5 254 503,751 1,983 382,235 1,505 885,986 0.0519
SC-CoT AQuA 2025/1/22 gpt-4o 86.61 98.82 0 temperature=1, path_num=5 254 744,478 2,931 628,728 2,475 1,373,206 8.1485
SC-CoT AQuA 2025/1/22 Qwen2.5-72B-Instruct 85.04 99.21 0 temperature=1, path_num=5 254 1,051,218 4,139 784,451 3,088 1,835,669 1.0348
SC-CoT AQuA 2025/1/22 Llama-3.3-70B-Instruct 82.28 99.21 0 temperature=1, path_num=5 254 1,135,251 4,469 772,673 3,042 1,907,924 1.0756
SC-CoT AQuA 2025/1/22 Qwen2.5-7B-Instruct 79.92 100 0 temperature=1, path_num=5 254 1,098,280 4,324 747,052 2,941 1,845,332 0
SC-CoT AQuA 2025/1/22 Llama-3.1-8B-Instruct 59.45 97.24 0 temperature=1, path_num=5 254 971,003 3,823 680,330 2,678 1,651,333 0
SC-CoT AQuA 2025/1/22 Internllm2_5-7B 39.37 98.03 0 temperature=1, path_num=5 254 1,420,494 5,592 875,728 3,448 2,296,222 0
SC-CoT AQuA 2025/1/22 Qwen2-1.5B-Instruct 23.62 96.46 0 temperature=1, path_num=5 254 1,034,362 4,072 740,973 2,917 1,775,335 0
SC-CoT AQuA 2025/1/22 Qwen2-0.5B-Instruct 22.83 97.24 0 temperature=1, path_num=5 254 1,246,929 4,909 968,162 3,812 2,215,091 0
SC-CoT AQuA 2025/2/10 deepseek-r1:1.5b 59.06 96.85 0 temperature=1, path_num=5 254 2,547,772 10,031 3,254,939 12,815 5,802,711 0
ReAct-Pro* AQuA 2025/1/7 gpt-3.5-turbo 64.57 98.03 0 max_steps=10 254 862,614 3,396 40,973 161 903,587 0.4928
ReAct-Pro* AQuA 2025/1/7 Doubao-lite-32k 77.56 96.06 0 max_steps=10 254 977,890 3,850 54,951 216 1,032,841 0.0445
ReAct-Pro* AQuA 2025/1/22 gpt-4o 57.48 97.24 0 max_steps=10 254 615,589 2,424 76,507 301 692,096 2.304
ReAct-Pro* AQuA 2025/1/22 Qwen2.5-72B-Instruct 73.23 100 0 max_steps=10 254 441,765 1,739 121,838 480 563,603 0.3177
ReAct-Pro* AQuA 2025/1/22 Llama-3.3-70B-Instruct 79.13 99.61 0 max_steps=10 254 1,119,143 4,406 243,236 958 1,362,379 0.768
ReAct-Pro* AQuA 2025/1/22 Qwen2.5-7B-Instruct 74.41 99.21 0 max_steps=10 254 564,165 2,221 131,679 518 695,844 0
ReAct-Pro* AQuA 2025/1/22 Llama-3.1-8B-Instruct 55.51 96.85 0 max_steps=10 254 3,764,723 14,822 576,098 2,268 4,340,821 0
ReAct-Pro* AQuA 2025/1/22 Internllm2_5-7B 40.94 96.85 0 max_steps=10 254 3,592,039 14,142 836,762 3,294 4,428,801 0
ReAct-Pro* AQuA 2025/1/22 Qwen2-1.5B-Instruct 25.59 96.06 0 max_steps=10 254 4,555,858 17,936 516,146 2,032 5,072,004 0
ReAct-Pro* AQuA 2025/1/22 Qwen2-0.5B-Instruct 24.02 96.85 0 max_steps=10 254 6,344,167 24,977 825,920 3,252 7,170,087 0
ReAct-Pro* AQuA 2025/2/10 deepseek-r1:1.5b 54.33 96.46 0 max_steps=10 254 10,578,715 41,648 3,866,326 15,222 14,445,041 0
ToT AQuA 2025/1/7 gpt-3.5-turbo 57.09 99.61 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 1,850,767 7,286 150,629 593 2,001,396 1.1513
ToT AQuA 2025/1/7 Doubao-lite-32k 45.28 74.02 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 1,850,249 7,284 150,301 592 2,000,550 0.0881
ToT AQuA 2025/1/22 gpt-4o 81.5 99.21 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 2,347,538 9,242 266,069 1,048 2,613,607 8.5295
ToT AQuA 2025/1/22 Qwen2.5-72B-Instruct 81.1 99.21 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 6,371,642 25,085 260,613 1,026 6,632,255 3.7389
ToT AQuA 2025/1/22 Llama-3.3-70B-Instruct 83.07 100 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 4,735,188 18,642 480,660 1,892 5,215,848 2.9404
ToT AQuA 2025/1/22 Qwen2.5-7B-Instruct 53.94 100 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 8,224,468 32,380 378,214 1,489 8,602,682 0
ToT AQuA 2025/1/22 Llama-3.1-8B-Instruct 59.06 100 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 4,896,222 19,276 843,462 3,321 5,739,684 0
ToT AQuA 2025/1/22 Internllm2_5-7B 35.83 99.61 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 4,263,136 16,784 471,424 1,856 4,734,560 0
ToT AQuA 2025/1/22 Qwen2-1.5B-Instruct 31.5 98.82 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 6,058,022 23,850 192,680 759 6,250,702 0
ToT AQuA 2025/1/22 Qwen2-0.5B-Instruct 29.92 100 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 8,100,085 31,890 600,196 2,363 8,700,281 0
ToT AQuA 2025/2/10 deepseek-r1:1.5b 24.8 55.51 0 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 254 605,028 2,382 189,484 746 794,512 0
IO MATH-500 2025/1/24 gpt-3.5-turbo 17.2 100 4 500 154,881 310 110,744 221 265,625 0.2436
IO MATH-500 2025/1/24 Doubao-lite-32k 37.4 100 4 500 166,870 334 144,860 290 311,730 0.0187
IO MATH-500 2025/1/22 gpt-4o 41.8 100 4 500 153,832 308 240,615 481 394,447 2.7907
IO MATH-500 2025/1/24 Qwen2.5-72B-Instruct 70.2 100 4 500 169,549 339 275,042 550 444,591 0.2506
IO MATH-500 2025/1/24 Llama-3.3-70B-Instruct 69.4 100 4 500 155,879 312 267,337 535 423,216 0.2386
IO MATH-500 2025/1/24 Qwen2.5-7B-Instruct 59.4 100 4 500 169,549 339 241,813 484 411,362 0
IO MATH-500 2025/1/24 Llama-3.1-8B-Instruct 38.6 100 4 500 155,563 311 348,371 697 503,934 0
IO MATH-500 2025/1/24 Internllm2_5-7B 22.8 100 4 500 201,883 404 266,005 532 467,888 0
IO MATH-500 2025/1/24 Qwen2-1.5B-Instruct 7 100 4 500 158,777 318 255,101 510 413,878 0
IO MATH-500 2025/1/24 Qwen2-0.5B-Instruct 2.6 100 4 500 159,049 318 270,281 541 429,330 0
IO MATH-500 2025/1/24 deepseek-r1:1.5b 43.8 100 4 500 157,049 314 865,499 1,731 1,022,548 0
CoT MATH-500 2025/1/24 gpt-3.5-turbo 39.8 100 4 500 329,381 659 102,815 206 432,196 0.3189
CoT MATH-500 2025/1/22 Doubao-lite-32k 59 100 4 500 336,370 673 143,571 287 479,941 0.0255
CoT MATH-500 2025/1/24 gpt-4o 68 100 4 500 329,332 659 223,356 447 552,688 3.0569
CoT MATH-500 2025/1/22 Qwen2.5-72B-Instruct 80.2 100 4 500 338,549 677 280,466 561 619,015 0.349
CoT MATH-500 2025/1/24 Llama-3.3-70B-Instruct 71.2 100 4 500 342,879 686 271,342 543 614,221 0.3463
CoT MATH-500 2025/1/24 Qwen2.5-7B-Instruct 69.8 100 4 500 354,049 708 263,155 526 617,204 0
CoT MATH-500 2025/1/24 Llama-3.1-8B-Instruct 25.8 100 4 500 342,879 686 282,689 565 625,568 0
CoT MATH-500 2025/1/24 Internllm2_5-7B 46.6 100 4 500 332,883 666 213,891 428 546,774 0
CoT MATH-500 2025/1/24 Qwen2-1.5B-Instruct 15.2 100 4 500 349,049 698 187,328 375 536,377 0
CoT MATH-500 2025/1/24 Qwen2-0.5B-Instruct 6.2 100 4 500 349,049 698 200,139 400 549,188 0
CoT MATH-500 2025/1/24 deepseek-r1:1.5b 49.4 100 4 500 341,549 683 857,580 1,715 1,199,129 0
PoT MATH-500 2025/2/10 gpt-3.5-turbo 28.8 83.8 4 500 239,902 480 32,014 64 271,916 0.168
PoT MATH-500 2025/2/10 Doubao-lite-32k 32.6 68 4 500 254,377 509 48,771 98 303,148 0.0144
PoT MATH-500 2025/2/10 gpt-4o 46.2 86.4 4 500 241,357 483 99,603 199 340,960 1.5994
PoT MATH-500 2025/2/10 Qwen2.5-72B-Instruct 47.2 82.2 4 500 242,549 485 170,823 342 413,372 0.233
PoT MATH-500 2025/2/10 Llama-3.3-70B-Instruct 42.6 80.2 4 500 253,879 508 249,717 499 503,596 0.2839
PoT MATH-500 2025/2/10 Qwen2.5-7B-Instruct 39.6 74.4 4 500 258,549 517 150,263 301 408,812 0
PoT MATH-500 2025/2/10 Llama-3.1-8B-Instruct 25.4 68.4 4 500 253,879 508 208,392 417 462,271 0
PoT MATH-500 2025/2/10 Internllm2_5-7B 15 32.4 4 500 247,883 496 120,826 242 368,709 0
PoT MATH-500 2025/2/10 Qwen2-1.5B-Instruct 0.8 2.2 4 500 248,509 497 538,361 1,077 786,870 0
PoT MATH-500 2025/2/10 Qwen2-0.5B-Instruct 0 0 4 500 253,549 507 183,653 367 437,202 0
PoT MATH-500 2025/2/10 deepseek-r1:1.5b 1 1.6 4 500 245,549 491 785,518 1,571 1,031,067 0
SC-CoT MATH-500 2025/2/10 gpt-3.5-turbo 28.8 100 4 temperature=1, path_num=5 500 1,381,818 2,764 856,994 1,714 2,238,812 1.9764
SC-CoT MATH-500 2025/2/10 Doubao-lite-32k 49.2 100 4 temperature=1, path_num=5 500 1,507,651 3,015 963,159 1,926 2,470,810 0.1406
SC-CoT MATH-500 2025/2/10 gpt-4o 34.4 100 4 temperature=1, path_num=5 500 1,986,584 3,973 1,468,739 2,937 3,455,323 19.6538
SC-CoT MATH-500 2025/2/10 Qwen2.5-72B-Instruct 74 100 4 temperature=1, path_num=5 500 3,823,997 7,648 1,773,516 3,547 5,597,513 3.1556
SC-CoT MATH-500 2025/2/10 Llama-3.3-70B-Instruct 74.2 100 4 temperature=1, path_num=5 500 3,959,492 7,919 1,759,247 3,518 5,718,739 3.2239
SC-CoT MATH-500 2025/2/10 Qwen2.5-7B-Instruct 67 100 4 temperature=1, path_num=5 500 3,833,751 7,668 1,617,733 3,235 5,451,484 0
SC-CoT MATH-500 2025/2/10 Llama-3.1-8B-Instruct 30.2 100 4 temperature=1, path_num=5 500 3,546,673 7,093 1,488,264 2,977 5,034,937 0
SC-CoT MATH-500 2025/2/10 Internllm2_5-7B 9.8 100 4 temperature=1, path_num=5 500 4,193,296 8,387 1,645,170 3,290 5,838,466 0
SC-CoT MATH-500 2025/2/10 Qwen2-1.5B-Instruct 3.8 99 4 temperature=1, path_num=5 500 3,832,429 7,665 1,737,013 3,474 5,569,442 0
SC-CoT MATH-500 2025/2/10 Qwen2-0.5B-Instruct 0.8 100 4 temperature=1, path_num=5 500 4,448,663 8,897 2,413,393 4,827 6,862,056 0
SC-CoT MATH-500 2025/2/10 deepseek-r1:1.5b 38 100 4 temperature=1, path_num=5 500 7,080,559 14,161 7,661,550 15,323 14,742,109 0
ReAct-Pro* MATH-500 2025/2/10 gpt-3.5-turbo 23.8 100 4 max_steps=10 500 3,708,461 7,417 124,253 249 3,832,714 2.0406
ReAct-Pro* MATH-500 2025/2/10 Doubao-lite-32k 47.2 100 4 max_steps=10 500 4,234,620 8,469 154,046 308 4,388,666 0.186
ReAct-Pro* MATH-500 2025/2/10 gpt-4o 54 100 4 max_steps=10 500 5,834,537 11,669 318,718 637 6,153,255 17.7735
ReAct-Pro* MATH-500 2025/2/10 Qwen2.5-72B-Instruct 62.8 100 4 max_steps=10 500 5,747,268 11,495 379,849 760 6,127,117 3.4541
ReAct-Pro* MATH-500 2025/2/10 Llama-3.3-70B-Instruct 64.6 100 4 max_steps=10 500 5,223,611 10,447 418,268 837 5,641,879 3.1806
ReAct-Pro* MATH-500 2025/2/10 Qwen2.5-7B-Instruct 48.8 100 4 max_steps=10 500 4,646,708 9,293 343,532 687 4,990,240 0
ReAct-Pro* MATH-500 2025/2/10 Llama-3.1-8B-Instruct 28.8 100 4 max_steps=10 500 7,486,706 14,973 1,276,923 2,554 8,763,629 0
ReAct-Pro* MATH-500 2025/2/10 Internllm2_5-7B 14.8 100 4 max_steps=10 500 11,831,496 23,663 2,354,609 4,709 14,186,105 0
ReAct-Pro* MATH-500 2025/2/10 Qwen2-1.5B-Instruct 8.2 100 4 max_steps=10 500 8,430,774 16,862 556,287 1,113 8,987,061 0
ReAct-Pro* MATH-500 2025/2/10 Qwen2-0.5B-Instruct 0.6 100 4 max_steps=10 500 18,137,392 36,275 1,305,048 2,610 19,442,440 0
ReAct-Pro* MATH-500 2025/2/10 deepseek-r1:1.5b 24.4 100 4 max_steps=10 500 20,729,970 41,460 9,447,378 18,895 30,177,348 0
ToT MATH-500 2025/2/10 gpt-3.5-turbo 9.8 100 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 9,711,244 19,422 290,523 581 10,001,767 5.2914
ToT MATH-500 2025/2/10 Doubao-lite-32k 1.2 94.2 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 5,338,500 10,677 226,000 452 5,564,500 0.2371
ToT MATH-500 2025/2/10 gpt-4o 3.2 100 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 14,881,985 29,764 360,447 721 15,242,432 40.8094
ToT MATH-500 2025/2/10 Qwen2.5-72B-Instruct 10.8 100 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 15,657,730 31,315 381,631 763 16,039,361 9.0421
ToT MATH-500 2025/2/10 Llama-3.3-70B-Instruct 1.4 69.8 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 14,099,500 28,199 570,000 1,140 14,669,500 8.2699
ToT MATH-500 2025/2/10 Qwen2.5-7B-Instruct 1.4 91.6 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 9,749,000 19,498 418,500 837 10,167,500 0
ToT MATH-500 2025/2/10 Llama-3.1-8B-Instruct 1.8 90.8 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 7,729,000 15,458 1,306,000 2,612 9,035,000 0
ToT MATH-500 2025/2/10 Internllm2_5-7B 0.2 99 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 7,515,000 15,030 835,500 1,671 8,350,500 0
ToT MATH-500 2025/2/10 Qwen2-1.5B-Instruct 0.8 97.2 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 4,408,000 8,816 127,000 254 4,535,000 0
ToT MATH-500 2025/2/10 Qwen2-0.5B-Instruct 0 96.2 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 5,590,500 11,181 406,000 812 5,996,500 0
ToT MATH-500 2025/2/10 deepseek-r1:1.5b 0.4 71.6 4 search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true 500 1,831,000 3,662 110,500 221 1,941,500 0

Default settings:

temperature = 0  (except for SC-CoT)

LLM prices:

  • LLM prices:
    • gpt-3.5-turbo:
      • 0.5$/1M tokens (input)
      • 1.5$/1M tokens (output)
    • Doubao-lite-32k (1 USD = 7.3249 CNY):
      • 0.04096$/1M tokens (input)
      • 0.08200$/1M tokens (output)
    • gpt-4o-2024-08-06:
      • 2.50$ /1M input tokens (input)
      • 10$ /1M output tokens (output)
    • Qwen2.5-72B-Instruct and Llama-3.3-70B-Instruct:
    • Other open source LLMs:
      • Deployed locally, please check the OmAgent repository for more information.
      • Cost is not considered in the leaderboard.

Pass Rate*: The pass rate is calculated by evaluating the percentage of predictions that are valid, where a prediction is valid if it is neither empty nor null.

Compare to original agent repositories

Algorithm Dataset Eval Time LLM Framework Score
CoT gsm8k 2025/1/7 gpt-3.5-turbo Original repo 79.23
CoT gsm8k 2025/1/7 gpt-3.5-turbo OmAgent 78.70
CoT AQuA 2025/1/7 gpt-3.5-turbo Original repo 60.63
CoT AQuA 2025/1/7 gpt-3.5-turbo OmAgent 61.02
PoT gsm8k 2025/1/7 gpt-4o-mini Original repo 86.35
PoT gsm8k 2025/1/7 gpt-4o-mini OmAgent 88.25
ReAct AQuA 2025/1/7 gpt-3.5-turbo Original repo 35.04
ReAct AQuA 2025/1/7 gpt-3.5-turbo OmAgent 34.25
ReAct HotpotQA 2025/1/8 gpt-3.5-turbo Original repo 28.00
ReAct HotpotQA 2025/1/8 gpt-3.5-turbo OmAgent 27.40

Note:

  • The original repo is the official repository of the agent implementation.
  • OmAgent is the implementation of the agent in this project.
  • There is no official implementation of SC-CoT.

Comparison ReAct with ReAct-Pro

Algorithm Dataset Eval Time LLM Score Pass Rate
ReAct gsm8k 2025/1/7 gpt-3.5-turbo 38.13 100.00
ReAct-Pro gsm8k 2025/1/7 gpt-3.5-turbo 74.91 99.39
ReAct AQuA 2025/1/7 gpt-3.5-turbo 34.25 97.64
ReAct-Pro AQuA 2025/1/7 gpt-3.5-turbo 64.57 98.03

πŸ”— Related works

Open Agent Leaderboard is built on top of the OmAgent repository.

πŸ™ Acknowledgments

We extend our deepest gratitude to the authors and contributors of the following datasets: gsm8k, AQuA, MATH-500, and agent algorithms: CoT, SC-CoT, PoT, ReAct, ToT, and LLMs: gpt-3.5-turbo, Doubao-lite-32k, gpt-4o, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, Qwen2-1.5B-Instruct, Qwen2-0.5B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Internllm2_5-7B, deepseek-r1:1.5b.

⭐️ Citation

If you find our repository beneficial, please cite our repository:

@misc{open-agent-leaderboard,
    title={Open Agent Leaderboard},
    author={Om AI Lab},
    year={2025},
    publisher={GitHub},
    howpublished={\url{https://github.com/om-ai-lab/open-agent-leaderboard}}
}

πŸ”” Follow us

You can follow us on X and Discord for more updates and discussions.

🀝 Contributing

Feel free to submit issues and pull requests.

πŸ“ License

This project is licensed under the MIT License.