llm-evaluation-framework

Star

Here are 37 public repositories matching this topic...

Language:All

Filter by language

All37 Python24 Jupyter Notebook6 TypeScript5 Go1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

confident-ai /deepeval

Star12.4k

The LLM Evaluation Framework

python hacktoberfest evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

UpdatedNov 28, 2025
Python

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

UpdatedNov 29, 2025
TypeScript

msoedov /agentic_security

Star1.7k

Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

UpdatedNov 27, 2025
Python

JinjieNi /MixEval

Star253

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

UpdatedNov 10, 2024
Python

cvs-health /langfair

Star242

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

UpdatedNov 26, 2025
Python

rhesis-ai /rhesis

Star192

Open-source testing platform & SDK for LLM and agentic applications. Define what your app should and shouldn't do in plain language, and Rhesis generates hundreds of test scenarios, runs them, and shows you where it breaks before production. Built for cross-functional teams to collaborate.

open-source test-generation quality-assessment test-management test-execution responsible-ai trustworthy-ai generative-ai llmops llm-evaluation llm-evaluation-framework

UpdatedNov 28, 2025
Python

Addepto /contextcheck

Star87

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

open-source ci testing-tools chatbot-framework testing-framework chatbot-testing rag ai-chat large-language-models llm ai-testing llm-evaluation llm-evaluation-framework prompt-test llm-testing ai-testing-tool generative-ai-testing rag-testing summarization-testing

UpdatedDec 11, 2024
Python

parea-ai /parea-sdk-py

Star81

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

UpdatedFeb 13, 2025
Python

zli12321 /qa_metrics

Star58

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

qa-automation-test rl-training llm exact-matching llm-evaluation llm-evaluation-toolkit llm-evaluation-framework reward-modeling