pydantic/pydantic-aiPublic

NotificationsYou must be signed in to change notification settings
Fork1k
Star11k

Reinforcement Learning for Pydantic AI Agents #2202

New issue

Open

Reinforcement Learning for Pydantic AI Agents#2202

Description

benomahony

opened

on Jul 14, 2025

Description

Idea: Your OpenTelemetry traces are perfect GRPO training data

Been playing around with pydantic-ai and realized something - you’ve already solved the hardest part of training agents with RL.

What you’ve already built

The OpenTelemetry integration is quietly brilliant for this. Every agent run gets traced with:

The full prompt/messages
The raw LLM response
Whether validation succeeded/failed
The specific validation errors if it failed

That’s literally a perfect GRPO training dataset sitting right there.

The connection I’m seeing

There’s this whole trend of using schema validation as the reward signal for RL training:

OpenPipe ART - their production GRPO setup does exactly this
rLLM project - how they trained DeepSWE to 59% on SWE-Bench
dspy - have a fine tuning optimiser implementation which looks very interesting

The pattern is always the same:

Generate response
Try to validate against schema
Success = reward 1.0, failure = reward 0.0
Train with GRPO to maximize validation success rate

What if…

What if there was just a simple utility to turn your existing OTel traces into training data?

frompydantic_ai.experimentalimporttrain_from_traces# Point at your existing tracestraces=load_otel_traces("path/to/traces")# or from your observability backend# Train a LoRA adapter to improve validation success rateadapter=train_from_traces(traces=traces,base_model="meta-llama/Llama-3.1-8B",target_schema=MyResultType,method="grpo"# could support others later)# Use the improved modelagent=Agent(adapter,result_type=MyResultType)

Why this feels natural

You’re already doing the hard parts:

✅ Structured output definition (result_type)
✅ Validation with rich error details
✅ Comprehensive tracing of everything
✅ Clean abstractions for model swapping

The RL training bit is just: “hey, here’s 1000 examples of what worked and what didn’t, please get better at the thing that worked.”

Super low-lift starting point

Could literally just be a script that:

Reads OTel traces from pydantic-ai agents
Extracts (prompt, response, validation_success) tuples
Runs basic GRPO training (plenty of open implementations)
Outputs a LoRA adapter

No changes to the core framework needed. Just a bridge between your excellent tracing and the RL training world.

The cool part is it would work retroactively - any agent you’ve been running and tracing could potentially be improved just from its historical data.

Worth exploring? The data collection problem is already solved, which is usually the hard part.

Extensions:

Once the validation RL is implemented then we could look into using the evals framework as an input dataset for more nuanced RL and improvements towards agent outcomes

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reinforcement Learning for Pydantic AI Agents #2202

Description

Description

Idea: Your OpenTelemetry traces are perfect GRPO training data

What you’ve already built

The connection I’m seeing

What if…

Why this feels natural

Super low-lift starting point

Extensions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions