New paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research #28

maykcaldas · 2024-09-12T08:13:52Z

Paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research

Authors: Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom,

Abstract: Given that Large Language Models (LLMs) have made significant progress inwriting code, can they now be used to autonomously reproduce results fromresearch repositories? Such a capability would be a boon to the researchcommunity, helping researchers validate, understand, and extend prior work. Toadvance towards this goal, we introduce SUPER, the first benchmark designed toevaluate the capability of LLMs in setting up and executing tasks from researchrepositories. SUPERaims to capture the realistic challenges faced byresearchers working with Machine Learning (ML) and Natural Language Processing(NLP) research repositories. Our benchmark comprises three distinct problemsets: 45 end-to-end problems with annotated expert solutions, 152 sub problemsderived from the expert set that focus on specific challenges (e.g.,configuring a trainer), and 602 automatically generated problems forlarger-scale development. We introduce various evaluation measures to assessboth task success and progress, utilizing gold solutions when available orapproximations otherwise. We show that state-of-the-art approaches struggle tosolve these problems with the best model (GPT-4o) solving only 16.3% of theend-to-end set, and 46.1% of the scenarios. This illustrates the challenge ofthis task, and suggests that SUPER can serve as a valuable resource for thecommunity to make and measure progress.

Link: https://arxiv.org/abs/2409.07440

Reasoning: produce the answer}. We start by examining the title and abstract for any mention of language models. The title mentions "Evaluating Agents" and "Tasks from Research," which does not directly indicate language models. However, the abstract explicitly mentions "Large Language Models (LLMs)" and discusses their capabilities and evaluation. This indicates that the paper is indeed focused on language models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research #28

New paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research #28

maykcaldas commented Sep 12, 2024

New paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research #28

New paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research #28

Comments

maykcaldas commented Sep 12, 2024