You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Abstract: Given that Large Language Models (LLMs) have made significant progress inwriting code, can they now be used to autonomously reproduce results fromresearch repositories? Such a capability would be a boon to the researchcommunity, helping researchers validate, understand, and extend prior work. Toadvance towards this goal, we introduce SUPER, the first benchmark designed toevaluate the capability of LLMs in setting up and executing tasks from researchrepositories. SUPERaims to capture the realistic challenges faced byresearchers working with Machine Learning (ML) and Natural Language Processing(NLP) research repositories. Our benchmark comprises three distinct problemsets: 45 end-to-end problems with annotated expert solutions, 152 sub problemsderived from the expert set that focus on specific challenges (e.g.,configuring a trainer), and 602 automatically generated problems forlarger-scale development. We introduce various evaluation measures to assessboth task success and progress, utilizing gold solutions when available orapproximations otherwise. We show that state-of-the-art approaches struggle tosolve these problems with the best model (GPT-4o) solving only 16.3% of theend-to-end set, and 46.1% of the scenarios. This illustrates the challenge ofthis task, and suggests that SUPER can serve as a valuable resource for thecommunity to make and measure progress.
Reasoning: produce the answer}. We start by examining the title and abstract for any mention of language models. The title mentions "Evaluating Agents" and "Tasks from Research," which does not directly indicate language models. However, the abstract explicitly mentions "Large Language Models (LLMs)" and discusses their capabilities and evaluation. This indicates that the paper is indeed focused on language models.
The text was updated successfully, but these errors were encountered:
Paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research
Authors: Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom,
Abstract: Given that Large Language Models (LLMs) have made significant progress inwriting code, can they now be used to autonomously reproduce results fromresearch repositories? Such a capability would be a boon to the researchcommunity, helping researchers validate, understand, and extend prior work. Toadvance towards this goal, we introduce SUPER, the first benchmark designed toevaluate the capability of LLMs in setting up and executing tasks from researchrepositories. SUPERaims to capture the realistic challenges faced byresearchers working with Machine Learning (ML) and Natural Language Processing(NLP) research repositories. Our benchmark comprises three distinct problemsets: 45 end-to-end problems with annotated expert solutions, 152 sub problemsderived from the expert set that focus on specific challenges (e.g.,configuring a trainer), and 602 automatically generated problems forlarger-scale development. We introduce various evaluation measures to assessboth task success and progress, utilizing gold solutions when available orapproximations otherwise. We show that state-of-the-art approaches struggle tosolve these problems with the best model (GPT-4o) solving only 16.3% of theend-to-end set, and 46.1% of the scenarios. This illustrates the challenge ofthis task, and suggests that SUPER can serve as a valuable resource for thecommunity to make and measure progress.
Link: https://arxiv.org/abs/2409.07440
Reasoning: produce the answer}. We start by examining the title and abstract for any mention of language models. The title mentions "Evaluating Agents" and "Tasks from Research," which does not directly indicate language models. However, the abstract explicitly mentions "Large Language Models (LLMs)" and discusses their capabilities and evaluation. This indicates that the paper is indeed focused on language models.
The text was updated successfully, but these errors were encountered: