llm-as-a-judge

Star

Here are 49 public repositories matching this topic...

Language:All

Filter by language

All49 Python36 Jupyter Notebook8 Rust2 C#1 JavaScript1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

Agenta-AI /agenta

Star3.3k

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

UpdatedNov 5, 2025
Python

prometheus-eval /prometheus-eval

Star1k

Evaluate your LLM's response with Prometheus and GPT4 💯

python evaluation gpt4 llm llmops vllm litellm llm-as-a-judge llm-as-evaluator

UpdatedApr 25, 2025
Python

metauto-ai /agent-as-a-judge

Star661

👩‍⚖️ Coding Agent-as-a-Judge

llms llm-as-a-judge agent-as-a-judge

UpdatedMay 14, 2025
Python

MigoXLab /dingo

Star536

Dingo: A Comprehensive AI Data Quality Evaluation Tool

spark data-validation datascience openai gpt data-quality vlm common-crawl data-evaluation dataquality hallucination data-quality-report data-quality-assessment llm qwen data-agent deepseek hallucination-detection llm-as-a-judge opencompass

UpdatedNov 6, 2025
JavaScript

haizelabs /verdict

Star308

Inference-time scaling for LLMs-as-a-judge.

reward-shaping llm llm-as-a-judge test-time-compute inference-time-compute llm-judge test-time-scaling

UpdatedNov 5, 2025
Jupyter Notebook

IAAR-Shanghai /xFinder

Star176

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

benchmark regex reliability evaluation dataset gpt phi large-language-models llm cc-by-nc-nd-4 open-compass chatglm qwen lm-evaluation llm-as-a-judge llm-as-evaluator xfinder reliable-evaluation key-answer-extraction judge-model

UpdatedFeb 26, 2025
Python

IAAR-Shanghai /xVerify

Star137

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

benchmark regex reliability evaluation llm reliability-tools chatgpt cc-by-nc-nd-4 open-compass llm-as-a-judge deepseek-math judge-model reasoning-models open-r1 xverify math-verify

UpdatedApr 17, 2025
Python

martin-wey /CodeUltraFeedback

Star72

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

code-generation dpo large-language-models reinforcement-learning-from-human-feedback llm-as-a-judge codeultrafeedback

UpdatedJun 25, 2024
Python

KID-22 /LLM-IR-Bias-Fairness-Survey

Star58

This is the repo for the survey of Bias and Fairness in IR with LLMs.

information-retrieval recommender-systems bias ir fairness large-language-models llm chatgpt llm4rec llm4rs llm-as-a-judge llm-as-evaluator llm4ir

UpdatedSep 4, 2025

lupantech /ineqmath

Star52

Solving Inequality Proofs with Large Language Models.

theorem-proving inequality olympiad llms llm-as-a-judge math-reasoning

UpdatedNov 1, 2025
Python

MJ-Bench /MJ-Bench

Star48

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

reward-models multimodal-foundation-model llm-benchmarking llm-as-a-judge multimodal-judge

UpdatedJun 3, 2025
Jupyter Notebook

whitecircle-ai /circle-guard-bench

Star44

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

benchmarking benchmark ai jailbreak safeguard guardrail guardrails large-language-models llm large-language-model llm-security llm-eval llm-evaluation llm-as-a-judge llm-jailbreaks

UpdatedNov 3, 2025
Python

docling-project /docling-sdg

Star35

A set of tools to create synthetically-generated data from documents

ai question-answering documents sdg llm-as-a-judge

UpdatedAug 15, 2025
Python

Cominclip /OmniVerifier

Star31

Generative Universal Verifier as Multimodal Meta-Reasoner

vision-language-model multimodal-large-language-models llm-as-a-judge multimodal-reasoning

UpdatedNov 3, 2025
Python

zhaochen0110 /Timo

Star24

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

temporal-reasoning sota-model llms rlhf rlaif llm-as-a-judge llm-as-evaluator self-critic-framework colm2024

UpdatedOct 23, 2024
Python

minnesotanlp /cobbler

Star21

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

UpdatedFeb 16, 2024
Jupyter Notebook

PKU-ONELab /Themis

Star20

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluation nlg llm-as-a-judge

UpdatedFeb 23, 2025
Python

OussamaSghaier /CuREV

Star17

Harnessing Large Language Models for Curated Code Reviews

code-review software-maintenance empirical-software-engineering large-language-models dataset-curation llm-as-a-judge

UpdatedMar 19, 2025
Python

OtherVibes /mcp-as-a-judge

Star13

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

mcp evaluation sampling sdd elicitation llm-as-a-judge spec-driven-development mcp-as-a-judge

UpdatedOct 27, 2025
Python

root-signals /rs-sdk

Star12

Root Signals SDK

evaluation observability llm evals llm-as-a-judge

UpdatedNov 5, 2025
Python

Improve this page

Add a description, image, and links to thellm-as-a-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thellm-as-a-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly