Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
#

evaluation

Here are 2,047 public repositories matching this topic...

mlflow

The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

  • UpdatedFeb 20, 2026
  • Python
langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

  • UpdatedFeb 20, 2026
  • TypeScript

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

  • UpdatedFeb 20, 2026
  • Python

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

  • UpdatedFeb 11, 2026
  • Go

Supercharge Your LLM Application Evaluations 🚀

  • UpdatedJan 31, 2026
  • Python

🤘 awesome-semantic-segmentation

  • UpdatedMay 8, 2021

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

  • UpdatedFeb 20, 2026
  • TypeScript

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

  • UpdatedFeb 20, 2026
  • Python

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

  • UpdatedFeb 14, 2026
  • Python

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

  • UpdatedFeb 14, 2026
  • Go

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

  • UpdatedFeb 20, 2026
  • TypeScript
Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

  • UpdatedFeb 20, 2026
  • Python
AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

  • UpdatedDec 23, 2025
  • Python
evo

Arbitrary expression evaluation for golang

  • UpdatedMar 25, 2025
  • Go
agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

  • UpdatedFeb 20, 2026
  • TypeScript

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

  • UpdatedFeb 20, 2026
  • Python

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

  • UpdatedFeb 20, 2026
  • Python

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

  • UpdatedFeb 6, 2026

Improve this page

Add a description, image, and links to theevaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with theevaluation topic, visit your repo's landing page and select "manage topics."

Learn more


[8]ページ先頭

©2009-2026 Movatter.jp