Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research #28

Open
maykcaldas opened this issue Sep 12, 2024 · 0 comments

Comments

@maykcaldas
Copy link
Collaborator

Paper: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research

Authors: Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom,

Abstract: Given that Large Language Models (LLMs) have made significant progress inwriting code, can they now be used to autonomously reproduce results fromresearch repositories? Such a capability would be a boon to the researchcommunity, helping researchers validate, understand, and extend prior work. Toadvance towards this goal, we introduce SUPER, the first benchmark designed toevaluate the capability of LLMs in setting up and executing tasks from researchrepositories. SUPERaims to capture the realistic challenges faced byresearchers working with Machine Learning (ML) and Natural Language Processing(NLP) research repositories. Our benchmark comprises three distinct problemsets: 45 end-to-end problems with annotated expert solutions, 152 sub problemsderived from the expert set that focus on specific challenges (e.g.,configuring a trainer), and 602 automatically generated problems forlarger-scale development. We introduce various evaluation measures to assessboth task success and progress, utilizing gold solutions when available orapproximations otherwise. We show that state-of-the-art approaches struggle tosolve these problems with the best model (GPT-4o) solving only 16.3% of theend-to-end set, and 46.1% of the scenarios. This illustrates the challenge ofthis task, and suggests that SUPER can serve as a valuable resource for thecommunity to make and measure progress.

Link: https://arxiv.org/abs/2409.07440

Reasoning: produce the answer}. We start by examining the title and abstract for any mention of language models. The title mentions "Evaluating Agents" and "Tasks from Research," which does not directly indicate language models. However, the abstract explicitly mentions "Large Language Models (LLMs)" and discusses their capabilities and evaluation. This indicates that the paper is indeed focused on language models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant