evaluation
Here are 2,047 public repositories matching this topic...
Language:All
Sort:Most stars
The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
- Updated
Feb 20, 2026 - Python
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
- Updated
Feb 20, 2026 - TypeScript
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
- Updated
Feb 20, 2026 - Python
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
- Updated
Feb 11, 2026 - Go
🤘 awesome-semantic-segmentation
- Updated
May 8, 2021
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
- Updated
Feb 20, 2026 - TypeScript
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
- Updated
Feb 20, 2026 - Python
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
- Updated
Feb 14, 2026 - Python
Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.
- Updated
Feb 14, 2026 - Go
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
- Updated
Feb 20, 2026 - TypeScript
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
- Updated
Feb 20, 2026 - Python
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
- Updated
Dec 23, 2025 - Python
Python package for the evaluation of odometry and SLAM
- Updated
Feb 11, 2026 - Python
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
- Updated
Feb 20, 2026 - TypeScript
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
- Updated
Feb 20, 2026 - Python
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
- Updated
Feb 20, 2026 - Python
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
- Updated
Jan 11, 2021 - Haskell
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
- Updated
Feb 6, 2026
Improve this page
Add a description, image, and links to theevaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theevaluation topic, visit your repo's landing page and select "manage topics."