llm-evaluation

Star

Here are 374 public repositories matching this topic...

Language:All

Filter by language

All374 Python198 Jupyter Notebook66 TypeScript33 HTML14 JavaScript5 CSS2 Go2 Rust2 Vue2 C#1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

mlflow /mlflow

Star23.1k

The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

UpdatedNov 29, 2025
Python

langfuse /langfuse

Star18.8k

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

UpdatedNov 28, 2025
TypeScript

comet-ml /opik

Star16.2k

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground evaluation openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

UpdatedNov 28, 2025
Python

confident-ai /deepeval

Star12.4k

The LLM Evaluation Framework

python hacktoberfest evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

UpdatedNov 28, 2025
Python

promptfoo /promptfoo

Star9.2k

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

UpdatedNov 29, 2025
TypeScript

Arize-ai /phoenix

Star7.8k

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

UpdatedNov 25, 2025
Jupyter Notebook

NVIDIA /garak

Star6.5k

the LLM vulnerability scanner

ai vulnerability-assessment security-scanners llm-security llm-evaluation

UpdatedNov 26, 2025
Python

jeinlee1991 /chinese-llm-benchmark

Star5.2k

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括303个大模型，覆盖chatgpt、gpt-5、o4-mini、谷歌gemini-2.5、Claude4.5、智谱GLM-Z1、文心一言、qwen3-max、百川、讯飞星火、商汤senseChat、minimax等商用模型，以及kimi-k2、ernie4.5、minimax-M1、DeepSeek-R1-0528、deepseek-v3.2、qwen3-2507、llama4、GLM4.5、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超200万的大模型缺陷库！方便广大社区研究分析、改进大模型。

artificial-intelligence llm-agent llm-evaluation agentic-ai

UpdatedNov 28, 2025

Giskard-AI /giskard-oss

Sponsor

Star5k

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

UpdatedNov 18, 2025
Python

Helicone /helicone

Star4.8k

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

UpdatedNov 27, 2025
TypeScript

Marker-Inc-Korea /AutoRAG

Star4.4k

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

UpdatedNov 20, 2025
Python

PacktPublishing /LLM-Engineers-Handbook

Star4.4k

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

aws rag mlops llm llmops genai fine-tuning-llm llm-evaluation ml-system-design

UpdatedMar 8, 2025
Python

Agenta-AI /agenta

Star3.4k

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

UpdatedNov 28, 2025
Python

truera /trulens

Star2.9k

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

UpdatedNov 25, 2025
Python

lmnr-ai /lmnr

Star2.4k

Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.

rust open-source typescript ai monitoring analytics evaluation ts self-hosted rust-lang developer-tools agents observability aiops ai-observability llmops evals llm-evaluation llm-observability llm-workflow

UpdatedNov 27, 2025
TypeScript

msoedov /agentic_security

Star1.7k

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

UpdatedNov 27, 2025
Python

genieincodebottle /generative-ai

Star1.6k

Comprehensive resources on Generative AI, including a detailed roadmap, projects, use cases, interview preparation, and coding preparation.

mcp gemini interview-questions claude multimodal n8n n8n-workflow openai-api generative-ai langchain large-language-model genai llm-agent retrieval-augmented-generation llm-evaluation genai-usecase langgraph agentic-framework agentic-ai model-context-protocol

UpdatedNov 25, 2025
Jupyter Notebook

huggingface /aisheets

Star1.6k

Build, enrich, and transform datasets using AI models with no code

oss ai synthetic-data nocode llms llm-evaluation

UpdatedOct 23, 2025
TypeScript

microsoft /prompty

Star1.1k

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

promptengineering llms generative-ai llm-evaluation prompty

UpdatedNov 26, 2025
Python

cvs-health /uqlm

Star1.1k

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

UpdatedNov 21, 2025
Python

Improve this page

Add a description, image, and links to thellm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thellm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly