evaluation
Here are 1,801 public repositories matching this topic...
Language:All
Sort:Most stars
The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
- Updated
Dec 18, 2025 - Python
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
- Updated
Dec 17, 2025 - TypeScript
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
- Updated
Dec 18, 2025 - Python
🤘 awesome-semantic-segmentation
- Updated
May 8, 2021
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
- Updated
Dec 18, 2025 - TypeScript
LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.
- Updated
Dec 17, 2025 - Go
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
- Updated
Dec 18, 2025 - Python
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
- Updated
Dec 17, 2025 - Python
Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.
- Updated
Dec 18, 2025 - Go
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
- Updated
Dec 18, 2025 - TypeScript
Easily build AI systems with Evals, RAG, Agents, fine-tuning, synthetic data, and more.
- Updated
Dec 18, 2025 - Python
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
- Updated
Nov 20, 2025 - Python
Python package for the evaluation of odometry and SLAM
- Updated
Nov 12, 2025 - Python
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
- Updated
Dec 17, 2025 - Python
Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)
- Updated
Jan 11, 2021 - Haskell
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
- Updated
Dec 18, 2025 - Python
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
- Updated
Sep 8, 2025
Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.
- Updated
Oct 1, 2024 - HTML
Improve this page
Add a description, image, and links to theevaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theevaluation topic, visit your repo's landing page and select "manage topics."