Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
#

llm-as-a-judge

Here are 49 public repositories matching this topic...

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

  • UpdatedNov 5, 2025
  • Python

Evaluate your LLM's response with Prometheus and GPT4 💯

  • UpdatedApr 25, 2025
  • Python

👩‍⚖️ Coding Agent-as-a-Judge

  • UpdatedMay 14, 2025
  • Python

Inference-time scaling for LLMs-as-a-judge.

  • UpdatedNov 5, 2025
  • Jupyter Notebook

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

  • UpdatedJun 25, 2024
  • Python

Solving Inequality Proofs with Large Language Models.

  • UpdatedNov 1, 2025
  • Python

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

  • UpdatedJun 3, 2025
  • Jupyter Notebook
circle-guard-bench

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

  • UpdatedNov 3, 2025
  • Python

A set of tools to create synthetically-generated data from documents

  • UpdatedAug 15, 2025
  • Python

Generative Universal Verifier as Multimodal Meta-Reasoner

  • UpdatedNov 3, 2025
  • Python

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

  • UpdatedOct 23, 2024
  • Python

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

  • UpdatedFeb 16, 2024
  • Jupyter Notebook

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

  • UpdatedFeb 23, 2025
  • Python

Harnessing Large Language Models for Curated Code Reviews

  • UpdatedMar 19, 2025
  • Python
mcp-as-a-judge

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

  • UpdatedOct 27, 2025
  • Python

Root Signals SDK

  • UpdatedNov 5, 2025
  • Python

Improve this page

Add a description, image, and links to thellm-as-a-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thellm-as-a-judge topic, visit your repo's landing page and select "manage topics."

Learn more


[8]ページ先頭

©2009-2025 Movatter.jp