evaluation

Star

Here are 2,047 public repositories matching this topic...

Language:All

Filter by language

All2,047 Python975 Jupyter Notebook271 JavaScript78 TypeScript77 Java63 HTML57 C++51 Go31 C#27 R27

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

mlflow /mlflow

Star24.3k

The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

UpdatedFeb 20, 2026
Python

langfuse /langfuse

Star22.1k

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

UpdatedFeb 20, 2026
TypeScript

comet-ml /opik

Star17.8k

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground evaluation openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

UpdatedFeb 20, 2026
Python

Tencent /WeKnora

Star13.1k

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

agent golang multi-tenant ai chatbot evaluation embeddings openai question-answering chatbots knowledge-base semantic-search reranking multimodel rag vector-search llm generative-ai agentic ollama

UpdatedFeb 11, 2026
Go

vibrantlabsai /ragas

Star12.7k

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

UpdatedJan 31, 2026
Python

mrgloom /awesome-semantic-segmentation

Star10.8k

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

UpdatedMay 8, 2021

promptfoo /promptfoo

Star10.5k

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

UpdatedFeb 20, 2026
TypeScript

oumi-ai /oumi

Star8.9k

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

evaluation inference llama fine-tuning sft dpo slms llms vlms gpt-oss gpt-oss-120b gpt-oss-20b

UpdatedFeb 20, 2026
Python

open-compass /opencompass

Star6.7k

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

UpdatedFeb 14, 2026
Python

coze-dev /coze-loop

Star5.3k

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability