Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
#

llm-evaluation-framework

Here are 37 public repositories matching this topic...

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

  • UpdatedNov 29, 2025
  • TypeScript
agentic_security

Open-source testing platform & SDK for LLM and agentic applications. Define what your app should and shouldn't do in plain language, and Rhesis generates hundreds of test scenarios, runs them, and shows you where it breaks before production. Built for cross-functional teams to collaborate.

  • UpdatedNov 28, 2025
  • Python

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

  • UpdatedFeb 13, 2025
  • Python

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

  • UpdatedJul 18, 2025
  • Python

Benchmarking Large Language Models for FHIR

  • UpdatedNov 19, 2025
  • TypeScript

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

  • UpdatedJul 19, 2024
  • Python

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

  • UpdatedOct 31, 2024
  • Python

Realign is a testing and simulation framework for AI applications.

  • UpdatedDec 4, 2024
  • Python

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

  • UpdatedOct 28, 2024
  • Jupyter Notebook

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

  • UpdatedMay 6, 2025
  • Jupyter Notebook

Measure of estimated confidence for non-hallucinative nature of outputs generated by Large Language Models.

  • UpdatedAug 6, 2025
  • Python

Multilingual Evaluation Toolkits

  • UpdatedNov 7, 2024
  • Python

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

  • UpdatedJan 17, 2025
  • TypeScript

Improve this page

Add a description, image, and links to thellm-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thellm-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more


[8]ページ先頭

©2009-2025 Movatter.jp