ai-evaluation
Here are 14 public repositories matching this topic...
Language:All
Sort:Most stars
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
- Updated
Mar 20, 2025 - HTML
Ranking LLMs on agentic tasks
- Updated
Mar 12, 2025 - Jupyter Notebook
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
- Updated
Mar 22, 2025 - TypeScript
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
- Updated
Jan 21, 2025
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
- Updated
Mar 20, 2025
Code scanner to check for issues in prompts and LLM calls
- Updated
Mar 22, 2025 - Python
JudgeGPT - (Fake) News Evaluation, a research project
- Updated
Feb 25, 2025 - Python
Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)
- Updated
Oct 1, 2024 - Jupyter Notebook
RJafroc quick start for those already familiar with windows jafroc
- Updated
Dec 28, 2023 - TeX
Repository for the LWDA'24 presentation on 'Psychometric Profiling of GPT Models for Bias Exploration', featuring conference materials including the poster, paper, slides, and references.
- Updated
Sep 23, 2024 - TeX
Improve this page
Add a description, image, and links to theai-evaluation topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theai-evaluation topic, visit your repo's landing page and select "manage topics."