Skip to content

Commit

Permalink
Add leaderboard
Browse files Browse the repository at this point in the history
  • Loading branch information
Ray Myers committed Apr 11, 2024
1 parent 95a430f commit bcabd6d
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 0 deletions.
1 change: 1 addition & 0 deletions docusaurus.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ const config: Config = {
// label: 'Tutorial',
// },
{to: '/blog', label: 'News', position: 'left'},
{to: '/leaderboards', label: 'Leaderboards', position: 'left'},
{
href: 'https://github.com/facebook/docusaurus',
label: 'GitHub',
Expand Down
44 changes: 44 additions & 0 deletions src/pages/leaderboards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: Leaderboards
---

# Leaderboards

## SWE-bench
**The gold standard**. Released in September 2023 by Princeton NLP, SWE-bench is the most widely accepted measure of an agent's ability to solve tasks in a realistic codebase.



*Last checked: 2024-04-10*
| Rank | Agent | Score | Score (lite) | Status | Group | License |
| ---- | -------------------- | ------ | ------------ | ----------------- | ------------ | ----------------------- |
| 1 | [auto-code-rover](https://github.com/nus-apr/auto-code-rover) | - | 22.3% | Reported | APR@NUS | GPL-3 |
| 2 | [SWE-agent](https://swe-agent.com/) + GPT 4 | 12.29% | 17% | Official | Proprietary | MIT |
| 3 | Devin | 13.48% | - | Reported, sample | Cognition | Proprietary |




An "unassisted" score means the agent is told which files need to be modified.

## LiveCodeBench

[LiveCodeBench](https://livecodebench.github.io/leaderboard.html): "Holistic and Contamination Free Evaluation of Large Language Models for Code"

Tests the strength of models across different coding sub-tasks.

* Code Generation
* Self-Repair
* Test Output Prediction
* Code Execution

*Last checked: 2024-04-10*
* Proprietary Leaders: GPT-4-Turbo-2024-04-09, Claude-3-Opus
* Open Weight Leaders: [WizardCoder-33B-V1.1](https://huggingface.co/WizardLM/WizardCoder-33B-V1.1), [deepseek-coder-33b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct), [CodeLlama-34b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf)




## HumanEval

[Link](https://paperswithcode.com/sota/code-generation-on-humaneval)

0 comments on commit bcabd6d

Please sign in to comment.