scicode-bench/SciCodePublic

NotificationsYou must be signed in to change notification settings
Fork20
Star127

A benchmark that challenges language models to code solutions for scientific problems

License

Apache-2.0 license

127 stars 20 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
eval		eval
src/scicode		src/scicode
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Repository files navigation

SciCode

Homepage |Paper

This repo contains the evaluation code for the paper "SciCode: A Research Coding Benchmark Curated by Scientists"

🔔News

[2025-02-17]: SciCode benchmark is available atHuggingFace Datasets!

[2025-02-01]: Results for DeepSeek-R1, DeepSeek-V3, and OpenAI o3-mini are added.

[2025-01-24]: SciCode has been integrated withinspect_ai for easier and faster model evaluations.

[2024-11-04]: Leaderboard is on! Checkhere. We have also added Claude Sonnet 3.5 (new) results.

[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.

[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.

[2024-08-22]: The SciCode benchmark has been successfully integrated intoOpenCompass.

[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.

Introduction

SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of16 subdomains from6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains338 subproblems decomposed from80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only7.7% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.

Dataset Creation

SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.

🏆 Leaderboard

Models	Main Problem Resolve Rate	Subproblem
🥇 OpenAI o3-mini-low	10.8	33.3
🥈 OpenAI o3-mini-high	9.2	34.4
🥉 OpenAI o3-mini-medium	9.2	33.0
OpenAI o1-preview	7.7	28.5
Deepseek-R1	4.6	28.5
Claude3.5-Sonnet	4.6	26.0
Claude3.5-Sonnet (new)	4.6	25.3
Deepseek-v3	3.1	23.7
Deepseek-Coder-v2	3.1	21.2
GPT-4o	1.5	25.0
GPT-4-Turbo	1.5	22.9
OpenAI o1-mini	1.5	22.2
Gemini 1.5 Pro	1.5	21.9
Claude3-Opus	1.5	21.5
Llama-3.1-405B-Chat	1.5	19.8
Claude3-Sonnet	1.5	17.0
Qwen2-72B-Instruct	1.5	17.0
Llama-3.1-70B-Chat	0.0	17.0
Mixtral-8x22B-Instruct	0.0	16.3
Llama-3-70B-Chat	0.0	14.6

Instructions to evaluate a new model using`inspect_ai` (recommended)

Scicode has been integrated withinspect_ai for easier and faster model evaluation. You need to run the following steps ro run:

Clone this repositorygit clone git@github.com:scicode-bench/SciCode.git
Install thescicode package withpip install -e .
Download thenumeric test results and save them as./eval/data/test_data.h5
Go to theeval/inspect_ai directory, setup correspoinding API key, and run the following command:

cd eval/inspect_aiexport OPENAI_API_KEY=your-openai-api-keyinspecteval scicode.py --model openai/gpt-4o --temperature 0

💡 For more detailed information of usinginspect_ai, seeeval/inspect_ai readme

Instructions to evaluate a new model in two steps (deprecated)

It should be noted that this is a deprecated way to evaluating models, and usinginspect_ai is the recommended way. Please use this method only ifinspect_ai does not work for your need. You need to run the first three steps in the above section, then run the following two commands:

Runeval/scripts/gencode.py to generate new model outputs (see theeval/scripts readme) for more information
Runeval/scripts/test_generated_code.py to evaluate the unittests

More information and FAQ

More information, including aFAQ section, is provided on ourwebsite.If you have trouble reaching the website, please find the markdown source in itsgithub repository.

Contact

Minyang Tian:mtian8@illinois.edu
Eliu Huerta:elihu@anl.gov
Hao Peng:haopeng@illinois.edu

Citation

@misc{tian2024scicode,title={SciCode: A Research Coding Benchmark Curated by Scientists},author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},year={2024},eprint={2407.13168},archivePrefix={arXiv},primaryClass={cs.AI}}

About

A benchmark that challenges language models to code solutions for scientific problems

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SciCode

🔔News

Introduction

Dataset Creation

🏆 Leaderboard

Instructions to evaluate a new model using`inspect_ai` (recommended)

Instructions to evaluate a new model in two steps (deprecated)

More information and FAQ

Contact

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors7

Uh oh!

Languages

Movatterモバイル変換

License

scicode-bench/SciCode

Folders and files

Latest commit

History

Repository files navigation

SciCode

🔔News

Introduction

Dataset Creation

🏆 Leaderboard

Instructions to evaluate a new model usinginspect_ai (recommended)

Instructions to evaluate a new model in two steps (deprecated)

More information and FAQ

Contact

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors7

Uh oh!

Languages

Instructions to evaluate a new model using`inspect_ai` (recommended)

Packages