llm-evaluation-framework
Here are 37 public repositories matching this topic...
Language:All
Sort:Most stars
The LLM Evaluation Framework
- Updated
Nov 28, 2025 - Python
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
- Updated
Nov 29, 2025 - TypeScript
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
- Updated
Nov 27, 2025 - Python
The official evaluation suite and dynamic data release for MixEval.
- Updated
Nov 10, 2024 - Python
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
- Updated
Nov 26, 2025 - Python
Open-source testing platform & SDK for LLM and agentic applications. Define what your app should and shouldn't do in plain language, and Rhesis generates hundreds of test scenarios, runs them, and shows you where it breaks before production. Built for cross-functional teams to collaborate.
- Updated
Nov 28, 2025 - Python
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
- Updated
Dec 11, 2024 - Python
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
- Updated
Feb 13, 2025 - Python
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
- Updated
Jul 18, 2025 - Python
Develop reliable AI apps
- Updated
Sep 2, 2025 - Python
Benchmarking Large Language Models for FHIR
- Updated
Nov 19, 2025 - TypeScript
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
- Updated
Jul 19, 2024 - Python
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
- Updated
Oct 31, 2024 - Python
Realign is a testing and simulation framework for AI applications.
- Updated
Dec 4, 2024 - Python
Open source framework for evaluating AI Agents
- Updated
Nov 23, 2025 - Python
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
- Updated
Oct 28, 2024 - Jupyter Notebook
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
- Updated
May 6, 2025 - Jupyter Notebook
Measure of estimated confidence for non-hallucinative nature of outputs generated by Large Language Models.
- Updated
Aug 6, 2025 - Python
Multilingual Evaluation Toolkits
- Updated
Nov 7, 2024 - Python
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
- Updated
Jan 17, 2025 - TypeScript
Improve this page
Add a description, image, and links to thellm-evaluation-framework topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thellm-evaluation-framework topic, visit your repo's landing page and select "manage topics."