llm-as-a-judge
Here are 49 public repositories matching this topic...
Language:All
Sort:Most stars
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
- Updated
Nov 5, 2025 - Python
Evaluate your LLM's response with Prometheus and GPT4 💯
- Updated
Apr 25, 2025 - Python
Dingo: A Comprehensive AI Data Quality Evaluation Tool
- Updated
Nov 6, 2025 - JavaScript
Inference-time scaling for LLMs-as-a-judge.
- Updated
Nov 5, 2025 - Jupyter Notebook
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
- Updated
Feb 26, 2025 - Python
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
- Updated
Apr 17, 2025 - Python
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
- Updated
Jun 25, 2024 - Python
This is the repo for the survey of Bias and Fairness in IR with LLMs.
- Updated
Sep 4, 2025
Solving Inequality Proofs with Large Language Models.
- Updated
Nov 1, 2025 - Python
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
- Updated
Jun 3, 2025 - Jupyter Notebook
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
- Updated
Nov 3, 2025 - Python
A set of tools to create synthetically-generated data from documents
- Updated
Aug 15, 2025 - Python
Generative Universal Verifier as Multimodal Meta-Reasoner
- Updated
Nov 3, 2025 - Python
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
- Updated
Oct 23, 2024 - Python
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
- Updated
Feb 16, 2024 - Jupyter Notebook
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
- Updated
Feb 23, 2025 - Python
Harnessing Large Language Models for Curated Code Reviews
- Updated
Mar 19, 2025 - Python
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations
- Updated
Oct 27, 2025 - Python
Improve this page
Add a description, image, and links to thellm-as-a-judge topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thellm-as-a-judge topic, visit your repo's landing page and select "manage topics."