SWE-bench
📣 New benchmark:CodeClash (website,github) evaluates SWE agents on goals, not tasks
📣 New: Meetmini, the 100 line AI agent that still gets 65% on SWE-bench verified!
This organization contains the source code for several projects in the SWE-* open source ecosystem, including:
- SWE-bench, a benchmark for evaluating AI systems on real world GitHub issues.
- SWE-agent, a system that automatically solves GitHub issues using an LM agent.
- SWE-smith, a toolkit for generating SWE training data at scale.
- mini, an AI agent written in just 100 lines of code that scores >70% on SWE-bench verified
Also check out the supporting infrastructure for working with SWE-* projects
PinnedLoading
- experiments
experiments PublicOpen sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
Repositories
Uh oh!
There was an error while loading.Please reload this page.
SWE-bench/SWE-bench’s past year of commit activity - SWE-smith-envs Public
Artifacts for building environments (Docker images) for repositories represented in SWE-smith
Uh oh!
There was an error while loading.Please reload this page.
SWE-bench/SWE-smith-envs’s past year of commit activity - experiments Public
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
SWE-bench/experiments’s past year of commit activity Uh oh!
There was an error while loading.Please reload this page.
SWE-bench/reading-list’s past year of commit activity - .github Public
Uh oh!
There was an error while loading.Please reload this page.
SWE-bench/.github’s past year of commit activity - humanevalfix-results Public archive
Evaluation data + results for SWE-agent inference on HumanEvalFix task
SWE-bench/humanevalfix-results’s past year of commit activity
Top languages
Loading…
Uh oh!
There was an error while loading.Please reload this page.