Movatterモバイル変換

NotificationsYou must be signed in to change notification settings
Fork87
Star1k

The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.

judgmentlabs.ai/

License

Apache-2.0 license

1k stars 87 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,642 Commits
.github		.github
assets		assets
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
update_version.py		update_version.py
uv.lock		uv.lock

Repository files navigation

Agent Behavior Monitoring (ABM)

Track and judge any agent behavior in online and offline setups. Set up Sentry-style alerts and analyze agent behaviors / topic patterns at scale!

[NEW] 🎆 Agent Reinforcement Learning

Train your agents with multi-turn reinforcement learning using judgeval andFireworks AI! Judgeval's ABM now integrates with Fireworks' Reinforcement Fine-Tuning (RFT) endpoint, supporting gpt-oss, qwen3, Kimi2, DeepSeek, and more.

Judgeval's agent monitoring infra provides a simple harness for integrating GRPO into any Python agent, giving builders a quick method totry RL with minimal code changes to their existing agents!

awaittrainer.train(agent_function=your_agent_function,# entry point to your agentscorers=[RewardScorer()],# Custom scorer you define based on task criteria, acts as rewardprompts=training_prompts# Tasks)

That's it! Judgeval automatically manages trajectory collection and reward tagging - your agent can learn from production data with minimal code changes.

👉 Check out theWikipedia Racer notebook, where an agent learns to navigate Wikipedia using RL, to see Judgeval in action.

You can view and monitor training progress for free via theJudgment Dashboard.

Judgeval Overview

Judgeval is an open-source framework for agent behavior monitoring. Judgeval offers a toolkit to track and judge agent behavior in online and offline setups, enabling you to convert interaction data from production/test environments into improved agents. To get started, try running one of the notebooks below or dive deeper in ourdocs.

Our mission is to unlock the power of production data for agent development, enabling teams to improve their apps by catching real-time failures and optimizing over their users' preferences.

📚 Cookbooks

Try Out	Notebook	Description
RL	Wikipedia Racer	Train agents with reinforcement learning
Online ABM	Research Agent	Monitor agent behavior in production
Custom Scorers	HumanEval	Build custom evaluators for your agents
Offline Testing	[Get Started For Free]	Compare how different prompts, models, or agent configs affect performance across ANY metric

You can access ourrepo of cookbooks.

You can find a list ofvideo tutorials for Judgeval use cases.

Why Judgeval?

🤖Simple to run multi-turn RL: Optimize your agents with multi-turn RL without managing compute infrastructure or data pipelines. Just add a few lines of code to your existing agent code and train!

⚙️Custom Evaluators: No restriction to only monitoring with prefab scorers. Judgeval provides simple abstractions for custom Python scorers, supporting any LLM-as-a-judge rubrics/models and code-based scorers that integrate to our live agent-tracking infrastructure.Learn more

🚨Production Monitoring: Run any custom scorer in a hosted, virtualized secure container to flag agent behaviors online in production. Get Slack alerts for failures and add custom hooks to address regressions before they impact users.Learn more

📊Behavior/Topic Grouping: Group agent runs by behavior type or topic for deeper analysis. Drill down into subsets of users, agents, or use cases to reveal patterns of agent behavior.

🧪Run experiments on your agents: Compare test different prompts, models, or agent configs across customer segments. Measure which changes improve agent performance and decrease bad agent behaviors.

🛠️ Quickstart

Get started with Judgeval by installing our SDK using pip:

pip install judgeval

Ensure you have yourJUDGMENT_API_KEY andJUDGMENT_ORG_ID environment variables set to connect to theJudgment Platform.

export JUDGMENT_API_KEY=...export JUDGMENT_ORG_ID=...

If you don't have keys,create an account for free on the platform!

Start monitoring with Judgeval

fromjudgeval.tracerimportTracer,wrapfromjudgeval.dataimportExamplefromjudgeval.scorersimportAnswerRelevancyScorerfromopenaiimportOpenAIjudgment=Tracer(project_name="default_project")client=wrap(OpenAI())# tracks all LLM calls@judgment.observe(span_type="tool")defformat_question(question:str)->str:# dummy toolreturnf"Question :{question}"@judgment.observe(span_type="function")defrun_agent(prompt:str)->str:task=format_question(prompt)response=client.chat.completions.create(model="gpt-5-mini",messages=[{"role":"user","content":task}]    )judgment.async_evaluate(# trigger online monitoringscorer=AnswerRelevancyScorer(threshold=0.5),# swap with any scorerexample=Example(input=task,actual_output=response),# customize to your datamodel="gpt-5",    )returnresponse.choices[0].message.contentrun_agent("What is the capital of the United States?")

Running this code will deliver monitoring results to yourfree platform account and should look like this:

Customizable Scorers Over Agent Behavior

Judgeval's strongest suit is the full customization over the types of scorers you can run online monitoring with. No restrictions to only single-prompt LLM judges or prefab scorers - if you can express your scorerin python code, judgeval can monitor it! Under the hood, judgeval hosts your scorer in a virtualized secure container, enabling online monitoring for any scorer.

First, create a behavior scorer in a file calledhelpfulness_scorer.py:

fromjudgeval.dataimportExamplefromjudgeval.scorers.example_scorerimportExampleScorer# Define custom example classclassQuestionAnswer(Example):question:stranswer:str# Define a server-hosted custom scorerclassHelpfulnessScorer(ExampleScorer):name:str="Helpfulness Scorer"server_hosted:bool=True# Enable server hostingasyncdefa_score_example(self,example:QuestionAnswer):# Custom scoring logic for agent behavior# Can be an arbitrary combination of code and LLM callsiflen(example.answer)>10and"?"notinexample.answer:self.reason="Answer is detailed and provides helpful information"return1.0else:self.reason="Answer is too brief or unclear"return0.0

Then deploy your scorer to Judgment's infrastructure:

echo"pydantic"> requirements.txtuv run judgeval upload_scorer helpfulness_scorer.py requirements.txt

Now you can instrument your agent with monitoring and online evaluation:

fromjudgeval.tracerimportTracer,wrapfromhelpfulness_scorerimportHelpfulnessScorer,QuestionAnswerfromopenaiimportOpenAIjudgment=Tracer(project_name="default_project")client=wrap(OpenAI())# tracks all LLM calls@judgment.observe(span_type="tool")defformat_task(question:str)->str:# replace with your prompt engineeringreturnf"Please answer the following question:{question}"@judgment.observe(span_type="tool")defanswer_question(prompt:str)->str:# replace with your LLM system callsresponse=client.chat.completions.create(model="gpt-5-mini",messages=[{"role":"user","content":prompt}]    )returnresponse.choices[0].message.content@judgment.observe(span_type="function")defrun_agent(question:str)->str:task=format_task(question)answer=answer_question(task)# Add online evaluation with server-hosted scorerjudgment.async_evaluate(scorer=HelpfulnessScorer(),example=QuestionAnswer(question=question,answer=answer),sampling_rate=0.9# Evaluate 90% of agent runs    )returnanswerif__name__=="__main__":result=run_agent("What is the capital of the United States?")print(result)

Congratulations! Your online eval result should look like this:

You can now run any online scorer in a secure Firecracker microVMs with no latency impact on your applications.

Judgeval is created and maintained byJudgment Labs.

About

The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.

judgmentlabs.ai/

Releases67

v0.23.2 Latest

Nov 29, 2025

+ 66 releases

Packages

No packages published

Contributors26

+ 12 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Agent Behavior Monitoring (ABM)

[NEW] 🎆 Agent Reinforcement Learning

Judgeval Overview

📚 Cookbooks

Why Judgeval?

🛠️ Quickstart

Start monitoring with Judgeval

Customizable Scorers Over Agent Behavior

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases67

Packages

Uh oh!

Contributors26

Uh oh!

Languages

Movatterモバイル変換

License

JudgmentLabs/judgeval

Folders and files

Latest commit

History

Repository files navigation

Agent Behavior Monitoring (ABM)

[NEW] 🎆 Agent Reinforcement Learning

Judgeval Overview

📚 Cookbooks

Why Judgeval?

🛠️ Quickstart

Start monitoring with Judgeval

Customizable Scorers Over Agent Behavior

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases67

Packages0

Uh oh!

Contributors26

Uh oh!

Languages

Packages