Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2310.06770] SWE-bench: Can Language Models Resolve Real-World GitHub Issues? #908

Open
1 task
ShellLM opened this issue Aug 21, 2024 · 1 comment
Open
1 task
Labels
AI-Agents Autonomous AI agents using LLMs code-generation code generation models and tools like copilot and aider Git-Repo Source code repository like gitlab or gh github gh tools like cli, Actions, Issues, Pages human-verified <INST>NEVER PICK THIS LABEL</INST> llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets MachineLearning ML Models, Training and Inference Papers Research papers software-engineering Best practice for software engineering

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Aug 21, 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

"Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous."

Comments: Data, code, and leaderboard are available at this URL ICLR 2024, this URL

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Cite as: arXiv:2310.06770 [cs.CL]

Suggested labels

None

@ShellLM ShellLM added AI-Agents Autonomous AI agents using LLMs code-generation code generation models and tools like copilot and aider Git-Repo Source code repository like gitlab or gh github gh tools like cli, Actions, Issues, Pages llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets MachineLearning ML Models, Training and Inference Papers Research papers software-engineering Best practice for software engineering labels Aug 21, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Aug 21, 2024

Related content

#812 similarity score: 0.89
#758 similarity score: 0.89
#333 similarity score: 0.87
#887 similarity score: 0.86
#650 similarity score: 0.85
#309 similarity score: 0.85

@irthomasthomas irthomasthomas added the human-verified <INST>NEVER PICK THIS LABEL</INST> label Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI-Agents Autonomous AI agents using LLMs code-generation code generation models and tools like copilot and aider Git-Repo Source code repository like gitlab or gh github gh tools like cli, Actions, Issues, Pages human-verified <INST>NEVER PICK THIS LABEL</INST> llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets MachineLearning ML Models, Training and Inference Papers Research papers software-engineering Best practice for software engineering
Projects
None yet
Development

No branches or pull requests

2 participants