Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Kodezi Chronos is a debugging-first language model that achieves state-of-the-art results on SWE-bench Lite (80.33%) and 67% real-world fix accuracy, over six times better than GPT-4. Built with Adaptive Graph-Guided Retrieval and Persistent Debug Memory. Model available Q1 2026 via Kodezi OS.

License

NotificationsYou must be signed in to change notification settings

Kodezi/Chronos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introducing Kodezi Chronos-1

The World's First Debugging-First Language Model for Repository-Scale Code Understanding

arXivModel AccessResearchBenchmarkLeaderboard

Performance Badges

SWE-bench LiteDebug Success RateHuman PreferenceImprovement over GPT-4.1Time Reduction

Key Achievements

80.33% SWE-bench Lite67.3% Autonomous Debugging89% Human Preference40% Time Reduction

Chronos Architecture


Table of Contents


Model Access Notice

Chronos is proprietary and available exclusively through Kodezi OS

TimelineAccessDetails
Q4 2025BetaLimited enterprise access
Q1 2026GAViaKodezi OS

This repository contains research paper, benchmarks, and evaluation results only.

Get Early AccessRead PaperView LeaderboardDocumentation


🏅 State-of-the-Art Results

📈 SWE-bench Lite Performance

Industry-Standard Benchmark Results

RankSystemSuccess RateInstancesLeadYear
1Kodezi Chronos80.33%241/300+20.0pp2025
2ExpeRepair-v1.0 + Claude 4.5 Sonnet60.33%181/300-2025
3Claude 4.5 Sonnet (Bash Only)~14%~42/300-66.3pp2025
4Claude 4.1 Opus (Bash Only)14.2%43/300-66.1pp2025
5GPT-4.113.8%41/300-66.5pp2025
6Gemini 2.0 Pro13.4%40/300-67.0pp2025

20 percentage point absolute lead over second place

The Debugging Gap

General-Purpose Models: Code Generation vs Debugging Performance

ModelSWE-bench Full
(Code Gen)
SWE-bench Lite
(Debugging)
Performance Gap
Claude 4.5 Sonnet72.7%~14%-58.7pp
Claude 4.1 Opus72.5%14.2%-58.3pp
Claude 4.1 Opus (Bash)67.60%14.2%-53.4pp
GPT-4.154.6%13.8%-40.8pp
Kodezi ChronosN/A80.33%Specialized

Key Insight: Even models achieving 70%+ on code generation drop to <15% on debugging tasks, revealing a 50+ percentage point gap.Chronos, purpose-built for debugging, achieves 80.33%—demonstrating that debugging requires specialized architectures, not just larger context windows.

Repository-Specific Results

SWE-bench Lite: Domain-Specific Performance

RepositoryDomainChronos SuccessInstancesSignificance
sympySymbolic Mathematics96.1%51/53Near-perfect mathematical reasoning
sphinxDocumentation Systems93.8%60/64Exceptional doc generation bugs
djangoWeb Frameworks90.4%104/115Complex framework debugging
OverallMixed Domains80.33%241/300State-of-the-art

🔬 MRR Benchmark Results

📊 Overall Performance (5,000 Multi-Random Retrieval Scenarios - Sample Dataset of 500 Available)

MetricChronosGPT-4.1Claude 4.1 OpusGemini 2.0 ProImprovement
Debug Success Rate67.3% ± 2.1%13.8%14.2%15.0%4.5x
Root Cause Accuracy89%*12.3% ± 1.8%11.7% ± 2.0%15.8% ± 1.5%5.6-7.6x
Retrieval Precision92%*68% ± 2.3%67% ± 2.4%74% ± 1.8%1.2-1.4x
Retrieval Recall85%32% ± 2.1%34% ± 2.0%42% ± 1.9%2.0-2.7x
Avg Fix Iterations7.81-21-21-2More thorough
Time Reduction40%---40% faster

p < 0.001 compared to best baseline (two-tailed t-test, n=5,000) • Sample dataset (n=500) available now, full benchmark Q1 2026

🐛 Performance by Bug Category

Bug CategoryChronosGPT-4.1Claude 4.1 OpusGemini 2.0 ProChronos Advantage
Syntax Errors94.2%82.3%79.8%85.1%1.1x
Logic Bugs72.8%12.1%10.7%15.3%6.0x
Concurrency Issues58.3%3.2%2.8%4.1%18.2x
Memory Problems61.7%5.7%4.3%6.9%10.8x
API Misuse79.1%18.9%16.2%22.4%4.2x
Performance Bugs65.4%7.4%6.1%9.8%8.8x

📏 Repository Scale Performance

Repository SizeChronos SuccessBest BaselineBaseline ModelImprovement
<10K LOC71.2% ± 2.8%21.3% ± 3.5%Gemini 2.0 Pro3.3x
10K-100K LOC68.9% ± 2.5%14.7% ± 3.2%Gemini 2.0 Pro4.7x
100K-1M LOC64.3% ± 2.9%8.9% ± 2.8%Gemini 2.0 Pro7.2x
>1M LOC59.7% ± 3.1%3.8% ± 1.9%Gemini 2.0 Pro15.7x

💡 Key Innovations

1.Debugging-First Architecture

  • Trained on42.5M real debugging examples (not code completion)
  • Specialized forroot cause analysis andmulti-file patches
  • 89% root cause accuracy vs 15.8% best baseline
  • 7-layer architecture optimized for debugging workflows

2.Persistent Debug Memory (PDM)

  • Repository-specific learning from15M+ debugging sessions
  • Improves from35% → 65% success rate over time
  • Cross-session pattern recognition and learning
  • 87% cache hit rate for similar bugs
  • Temporal pattern learning across project lifecycles

3.Adaptive Graph-Guided Retrieval (AGR)

  • O(k log d) complexity with dynamic k-hop expansion
  • 92% precision, 85% recall on multi-file context
  • Handlesunlimited repository scale intelligently
  • Multi-hop traversal with confidence-based termination
  • 3.8x faster than traditional retrieval methods

4.Output-Optimized Design

  • Optimized for~3K output tokens (fixes, tests, docs)
  • 47.2% output entropy density vs 12.8% for completion models
  • Designed forcomplex patch generation
  • Template-aware generation for consistency
  • Confidence-guided output strategy

5.Autonomous Debugging Loop

  • Average7.8 iterations to successful fix
  • Propose → Test → Analyze → Refine cycles
  • 67.3% fully autonomous success rate
  • Execution sandbox with real-time feedback
  • Iterative refinement until validation succeeds

🏗️ Architecture

Seven-Layer System Design

┌─────────────────────────────────────────────┐│   7. Explainability Layer                   │  Human-readable root cause analysis├─────────────────────────────────────────────┤│   6. Execution Sandbox                      │  Isolated test validation├─────────────────────────────────────────────┤│   5. Persistent Debug Memory (PDM)          │  Repository-specific learning├─────────────────────────────────────────────┤│   4. Orchestration Controller               │  Autonomous debugging loop├─────────────────────────────────────────────┤│   3. Debug-Tuned LLM Core                   │  42.5M debugging examples├─────────────────────────────────────────────┤│   2. Adaptive Retrieval Engine (AGR)        │  Dynamic k-hop graph traversal├─────────────────────────────────────────────┤│   1. Multi-Source Input Layer               │  Code, logs, traces, tests, docs└─────────────────────────────────────────────┘

Layer Descriptions

  1. Multi-Source Input Layer: Processes code, logs, traces, tests, docs simultaneously
  2. Adaptive Retrieval Engine (AGR): Dynamic k-hop graph traversal (92% precision)
  3. Debug-Tuned LLM Core: 42.5M debugging examples, not code completion
  4. Orchestration Controller: Autonomous debugging loop management
  5. Persistent Debug Memory (PDM): Repository-specific learning (35% → 65% improvement)
  6. Execution Sandbox: Isolated test validation environment
  7. Explainability Layer: Human-readable root cause analysis

View Detailed Architecture Documentation →


🧪 Benchmarks & Evaluation

📋 Available Benchmarks

BenchmarkTypeInstancesPurposeResults
SWE-bench LiteIndustry Standard300Real-world debugging80.33%
MRR BenchmarkCustom5,000 (500 sample)Multi-random retrieval67.3%
Repository ScaleCustomVariedLarge codebase testing59.7-71.2%
Bug CategoriesCustom4,400+Bug type specialization58.3-94.2%

🏆 SWE-bench Lite Evaluation Results

View Complete SWE-bench Lite Submission →

The evaluation directory contains:

  • README.md: Detailed submission results and methodology
  • metadata.yaml: Submission metadata and configuration
  • all_preds.jsonl: All 300 instance predictions
  • Kodezi Chronos-1.hybrid_eval.json: Complete evaluation metrics
  • logs/: Execution logs for all instances
  • results/: Per-instance results and analysis
  • trajs/: Debugging trajectories and fix attempts

🎯 Multi-Random Retrieval (MRR) Benchmark

MRR simulates real-world debugging complexity:

  • Spatial Distribution: Bug context scattered across 10-50 files
  • Temporal Dispersion: Relevant information from 3-12 months of history
  • Obfuscation Levels: Low/medium/high code complexity
  • 5,000 Scenarios: Comprehensive evaluation across languages (sample dataset of 500 available now, full benchmark Q1 2026)
MetricChronosGPT-4.1+RAGClaude 4.1+VectorDBGemini 2.0+Graph
Precision@1092%42.3%48.1%51.7%
Recall@1085%31.7%36.2%41.8%
Fix Accuracy67.3%8.9%11.2%14.6%
Context Efficiency0.710.230.280.31

View Complete Benchmark Documentation →


📚 Research Paper

Published Research

Title: Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding

Authors: Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel, Yousuf Zaii

Institution: Kodezi Inc.

Publication: arXiv:2507.12482 (2025)

Paper Resources

ResourceDescriptionLink
arXiv PaperOfficial publicationView
Full Paper (Markdown)Complete paper in markdownView
2025 UpdatesLatest research findingsView
AbstractExecutive summaryView
MethodologyResearch methodologyView
Related WorkLiterature reviewView
Future WorkResearch directionsView

Key Contributions

  1. Debugging-Specific Architecture: First LM trained specifically on debugging workflows (42.5M examples)
  2. Adaptive Graph-Guided Retrieval (AGR): Novel multi-hop retrieval with O(k log d) complexity
  3. Persistent Debug Memory (PDM): Cross-session learning system for repository-specific patterns
  4. Comprehensive Evaluation: 12,500 real-world bugs across multiple benchmarks
  5. State-of-the-Art Results: 80.33% on SWE-bench Lite (20pp lead over second place)

🚀 Getting Started

Prerequisites

# Python 3.8+ requiredpython --version# Git for cloninggit --version

Quick Start: Running Benchmarks

# Clone the repositorygit clone https://github.com/kodezi/chronos-research.gitcd chronos-research# Install dependenciespip install -r requirements.txt# Run MRR benchmark on your modelpython benchmarks/run_mrr_benchmark_2025.py \  --model your_model \  --scenarios 100# Start with subset for testing# Run full sample evaluation (500 scenarios available)python benchmarks/run_mrr_benchmark_2025.py \  --model your_model \  --scenarios 500# Analyze resultspython benchmarks/analyze_results.py \  --results_dir results/your_model

Model Access

The Chronos model is NOT included in this repository

This repository contains:

  • Research paper and documentation
  • Benchmark suite and evaluation framework
  • Performance results and analysis
  • Chronos model (proprietary - NOT included)

To access Chronos model:

Access MethodAvailabilityDetails
Kodezi OSQ4 2025 (Beta)Enterprise beta access
Kodezi OSQ1 2026 (GA)General availability
API AccessQ1 2026API endpoints

Join Waitlist → |Contact Sales →


📁 Repository Structure

chronos-research/│├── benchmarks/                    # Benchmark Suite│   ├── multi-random-retrieval/      # 5,000 scenario MRR benchmark (500 sample available)│   ├── comprehensive_benchmarks/    # Extended test scenarios│   ├── debug_categories/            # Bug type categorization (6 types)│   ├── evaluation_metrics/          # Custom metrics implementation│   ├── run_mrr_benchmark_2025.py    # Main benchmark runner│   └── analyze_results.py           # Results analysis tools│├── evaluation/                    # Evaluation Results│   └── lite/                        # SWE-bench Lite results (80.33%)│       └── 20251111_kodezi_chronos_1/  # Official submission│           ├── all_preds.jsonl      # All 300 predictions│           ├── logs/                # 300+ execution logs│           ├── results/             # Per-instance results│           └── trajs/               # Debugging trajectories│├── paper/                         # Research Paper│   ├── chronos-research.md          # Full paper (arXiv:2507.12482)│   ├── chronos-research-2025.md     # 2025 updates│   ├── abstract.md                  # Executive summary│   ├── methodology.md               # Research methodology│   └── figures/                     # Visualizations│├── architecture/                  # Architecture Documentation│   ├── README.md                    # Architecture overview│   ├── AGR_ALGORITHM.md             # Adaptive Graph-Guided Retrieval│   ├── memory_engine.md             # Persistent Debug Memory (PDM)│   └── debugging_loop.md            # Autonomous loop design│├── results/                       # Performance Data│   ├── figures/                     # 15+ SVG visualizations│   ├── ablation_studies/            # Component impact analysis│   ├── case_studies/                # Real-world debugging examples│   └── raw_data/                    # Benchmark outputs (CSV/JSON)│├── reference_implementations/     # Algorithm Reference Code│   ├── algorithms/                  # AGR, PDM reference implementations│   └── NOTICE.md                    # Proprietary notice│├── docs/                          # Documentation│   ├── getting_started.md           # Quick start guide│   ├── API_DOCUMENTATION.md         # API reference (Q1 2026)│   ├── faq.md                       # Frequently asked questions│   └── limitations.md               # Known constraints│├── LEADERBOARD.md                 # Performance rankings├── CITATION.cff                   # Citation information (BibTeX)├── CONTRIBUTING.md                # Contribution guidelines├── LICENSE                        # MIT License + proprietary notice└── requirements.txt               # Python dependencies

Key Directories:

  • benchmarks/: 5,000 scenario MRR benchmark (500 sample available), multi-language support, automated evaluation
  • evaluation/: SWE-bench Lite results (80.33%, 241/300 instances)
  • paper/: Complete research paper and documentation (arXiv:2507.12482)
  • architecture/: 7-layer system design, AGR/PDM documentation
  • results/: 12,500+ bug resolutions, visualizations, statistical analysis
  • reference_implementations/: Algorithm reference code (NOT the actual model)

🔬 Research Highlights

Training Dataset Composition

Data SourceVolumeDescription
Debugging Examples42.5MComplete debugging workflows
GitHub Issues15MIssues with verified fixes
Stack Traces8MError traces with resolutions
CI/CD Logs3MBuild and deployment debugging
Production Sessions2.5MReal-world production bugs
Curated Benchmarks14MDefects4J, SWE-bench, BugsInPy

Total Training Data: 42.5M debugging-specific examples (not code completion)

AGR Performance by Depth

Retrieval StrategySuccess RateAvg Time (s)Use Case
k=1 hop58.2%12.3Simple bugs
k=2 hops72.4%18.7Multi-file bugs
k=3 hops83.1%24.5Complex dependencies
k=adaptive87.1%23.4Optimal strategy
Flat retrieval23.4%45.2Baseline comparison

PDM Learning Curve

SessionsSuccess RateToken EfficiencyMemory Size
Initial35%1.0x0 GB
100 sessions52%3.2x2.1 GB
500 sessions65%7.3x8.7 GB
1000+ sessions67%8.1x15.2 GB

Key Insight: PDM enables continuous improvement through cross-session learning


📊 Detailed Performance

Language-Specific Performance

LanguageChronosGPT-4.1Claude 4.1 OpusGemini 2.0 ProTest Cases
Python68.7% ± 2.1%11.2% ± 2.8%10.3% ± 2.9%14.6% ± 2.6%1,823 bugs
JavaScript64.2% ± 2.3%7.8% ± 2.5%6.9% ± 2.6%10.1% ± 2.4%1,547 bugs
Java63.9% ± 2.2%6.3% ± 2.2%5.7% ± 2.3%9.2% ± 2.1%1,630 bugs
Go66.8% ± 2.4%9.1% ± 2.6%8.4% ± 2.7%12.3% ± 2.5%892 bugs
C++61.2% ± 2.6%5.2% ± 2.1%4.8% ± 2.2%7.9% ± 2.0%1,108 bugs
Rust59.8% ± 2.7%4.1% ± 1.9%3.7% ± 2.0%6.3% ± 1.8%687 bugs

Debugging Cycle Efficiency

IterationChronos SuccessGPT-4.1 SuccessTime SavedCumulative
1st Attempt42.3%3.2%-87%42.3%
2nd Attempt+16.4% (58.7%)+1.9% (5.1%)-83%58.7%
3rd Attempt+6.6% (65.3%)+1.7% (6.8%)-79%65.3%
4th+ Attempts+2.0% (67.3%)+1.7% (8.5%)-74%67.3%

Note: Chronos performs more thorough iterations (7.8 avg) vs competitors (1-2 avg)

Context Window Efficiency

ModelContext SizeDebug SuccessCost per BugNote
GPT-4.1 (32K)32K tokens7.2%$5.53More context ≠ better debugging
Claude 4.1 (200K)200K tokens9.8%$4.89Attention dilution at scale
Gemini 2.0 Pro (1M)1M tokens14.3%$4.25Best traditional model
ChronosUnlimited*71.2%$1.36*Via intelligent retrieval

Ablation Studies

ConfigurationDebug SuccessPrecisionRecallImpact
Full Chronos67.3%92%85%Complete system
w/o AGR (Flat Retrieval)28.7%42%31%-56% (critical)
w/o PDM (Static Memory)40.1%67%58%-39% (major)
w/o Orchestration Loop42.5%71%62%-35% (major)
w/o Multi-Code Association35.8%54%47%-45% (critical)
w/o Execution Sandbox48.2%78%69%-27% (significant)

📖 Documentation

Core Documentation

Getting StartedArchitectureBenchmarksAPI Reference
Quick start guideSystem design detailsEvaluation methodologyFuture API docs

Performance & Analysis

PerformanceCase StudiesFAQLimitations
Detailed metricsReal-world examplesCommon questionsKnown constraints

Results & Rankings

LeaderboardEvaluation ResultsAnalysisBenchmarks
Performance rankingsSWE-bench LiteStatistical analysisFull test suite

🤝 Contributing

We welcome contributions to the evaluation framework and benchmarks!

How to Contribute

# 1. Fork and clone the repositorygit clone https://github.com/[your-username]/chronos-research.gitcd chronos-research# 2. Create a feature branchgit checkout -b feature/your-contribution# 3. Make your changes# - Add new benchmarks# - Improve documentation# - Fix bugs in evaluation scripts# 4. Run testspython -m pytest tests/# 5. Commit your changesgit add.git commit -m"feat: description of your changes"# 6. Push and create PRgit push origin feature/your-contribution

Contribution Guidelines

  • Add tests for new features
  • Follow existing code style
  • Update documentation
  • Add benchmarks for new capabilities
  • Include performance analysis

SeeCONTRIBUTING.md for detailed guidelines.


📝 Citation

If you use this research in your work, please cite:

@article{khan2025chronos,title={Kodezi Chronos: A Debugging-First Language Model for         Repository-Scale Code Understanding},author={Khan, Ishraq and Chowdary, Assad and          Haseeb, Sharoz and Patel, Urvish and Zaii, Yousuf},journal={arXiv preprint arXiv:2507.12482},year={2025},url={https://arxiv.org/abs/2507.12482},note={State-of-the-art: 80.33\% on SWE-bench Lite}}

🏢 About Kodezi

Kodezi is building the future of autonomous software maintenance. Our mission is to empower developers with AI that truly understands code at scale.

Our Products

ProductDescriptionAvailability
Kodezi Code Web-IDEAI-powered web-based code editor with real-time debuggingAvailable Now
Kodezi CreateGenerate full applications from natural languageAvailable Now
Kodezi CLICommand-line interface for automated code analysis and fixesAvailable Now
Kodezi OSAutonomous software maintenance platform with Chronos integrationQ4 2025 (Beta)
ChronosDebugging-first language model (80.33% SWE-bench Lite)Via Kodezi OS
Enterprise APIAPI access for teams and enterprise deploymentQ1 2026

📧 Contact & Community

Connect With Us

WebsitePaperTwitterLinkedInEmail

For Enterprise

Sales:sales@kodezi.comSupport:support@kodezi.comPartnerships:partnerships@kodezi.com


📄 License

© Kodezi Inc. All rights reserved.Use is subject to Kodezi's Terms of Service.

MIT License

Copyright (c) 2025 Kodezi Inc.

Permission is hereby granted, free of charge, to any person obtaining a copyof this software and associated documentation files (the "Software"), to dealin the Software without restriction, including without limitation the rightsto use, copy, modify, merge, publish, distribute, sublicense, and/or sellcopies of the Software, and to permit persons to whom the Software isfurnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in allcopies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS ORIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THEAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHERLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THESOFTWARE.

⚠️ Important Notice

This license applies ONLY to the research paper, benchmarks, evaluation frameworks, and documentation contained in this repository.

TheKodezi Chronos model itself is proprietary technology owned by Kodezi Inc. and isNOT included in this repository or covered by this license.

📦 What's Included Under MIT License

  • Research Paper: arXiv publication and markdown versions
  • Benchmark Suite: MRR and evaluation frameworks
  • Evaluation Results: SWE-bench Lite results and analysis
  • Documentation: Architecture docs, guides, and references
  • Reference Implementations: Algorithm reference code (NOT the actual model)

🔒 Proprietary Components

  • Chronos Model: NOT included in this repository
  • Kodezi OS Integration: Proprietary platform components
  • Production APIs: Enterprise deployment infrastructure

🚀 Chronos Model Access

The Chronos model is available exclusively through Kodezi OS:


Research & Resources

Join Waitlist → |Read Paper → |View Results → |Learn More →


Last Updated: November 2025 | Version: 2.0.0

About

Kodezi Chronos is a debugging-first language model that achieves state-of-the-art results on SWE-bench Lite (80.33%) and 67% real-world fix accuracy, over six times better than GPT-4. Built with Adaptive Graph-Guided Retrieval and Persistent Debug Memory. Model available Q1 2026 via Kodezi OS.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp