KodCode-AI/kodcodePublic

NotificationsYou must be signed in to change notification settings
Fork16
Star287

✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork

License

Apache-2.0 license

287 stars 16 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
demo		demo
paper		paper
pipeline		pipeline
seeds		seeds
trainer		trainer
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile.test		Dockerfile.test
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Repository files navigation

🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning.

🏆 KodCode has been accepted to ACL 2025 and received theBest Paper Award at DataWorld @ ICML 2025!

🕸️Project Website - To discover the reasoning for the name of KodCode 🤨
📄Technical Report - Discover the methodology and technical details behind KodCode
💾Github Repo - Access the complete pipeline used to produce KodCode V1
🤗 HF Datasets:KodCode-V1 (For RL);KodCode-V1-SFT-R1 (for SFT)

Overview

Features

KodCode is a comprehensive pipeline designed to generate diverse, challenging, and verifiable synthetic datasets for coding tasks. Key features include:

Diverse Sources: Generate high-quality coding questions from multiple sources including zero-shot generation, human-written assessment questions, code snippets, and technical documentation -all unified in a single framework!
Self-Verification: Generate verifiable solutions and tests for each coding question. Support pytest and parallel execution.
Style Converter: Easy to convert between different styles of coding questions.

Installation

1. Build code generation environment

Option 1: conda

git clone https://github.com/KodCode-AI/kodcode.gitcd kodcodeconda create -n kodcode python=3.10 -yconda activate kodcodepip install -r requirements.txt

Option 2:uv

git clone https://github.com/KodCode-AI/kodcode.gitcd kodcodeuv venvsource .venv/bin/activateuv pip install -r requirements.txt

2. Build code execution environment

Option 1: Local

To run unit tests in parallel, you also need to installparallel. For example, if you are using Ubuntu, you can installparallel by:

sudo apt-get install parallel

Option 2: Docker

Please installNvidia container toolkit first to support GPU.

We provided aoff-the-shelf docker image for running tests:

docker pull zcxu/kodcode-test-environment:python3.10-cuda12.4-v0.1

Generate KodCode

Please refer to thepipeline for details.

Training

SFT: We usedLlama-Factory to train the SFT checkpoint.

RL: Please refer tocode-r1 for RL training using KodCode datasets, which is based onverl.

TODO

Repo Update

One-line command to generate KodCode
Integrate the test pipeline (i.e.,pytest) for RL training. -> Supported in forkedcode-r1 with the latestverl! Thank you Jiawei @ganler!!!
Implement dockerized execution for unit tests

Data Update

KodCode-Lite with 10K samples for light-weight RL training
KodCode-V1.1: Supportstdin format with ~50K additional samples

🧐 Other Information

License: Please followCC BY-NC 4.0.

Contact: For questions, suggestions, or feedback, please reach out toZhangchen Xu, orraise an issue. We welcome your input and are committed to continuously improving KodCode to better serve the community.

📚 Citation

If you find the model, data, or code useful, please cite:

@article{xu2025kodcode,      title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding},       author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran},      year={2025},      eprint={2503.02951},      archivePrefix={arXiv},      primaryClass={cs.LG},      url={https://arxiv.org/abs/2503.02951}, }

About

✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork

kodcode-ai.github.io/

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Overview

Features

Installation

Generate KodCode

Training

TODO

Repo Update

Data Update

🧐 Other Information

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

KodCode-AI/kodcode

Folders and files

Latest commit

History

Repository files navigation

🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

Overview

Features

Installation

Generate KodCode

Training

TODO

Repo Update

Data Update

🧐 Other Information

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages