- Notifications
You must be signed in to change notification settings - Fork1k
Description
Description
Idea: Your OpenTelemetry traces are perfect GRPO training data
Been playing around with pydantic-ai and realized something - you’ve already solved the hardest part of training agents with RL.
What you’ve already built
The OpenTelemetry integration is quietly brilliant for this. Every agent run gets traced with:
- The full prompt/messages
- The raw LLM response
- Whether validation succeeded/failed
- The specific validation errors if it failed
That’s literally a perfect GRPO training dataset sitting right there.
The connection I’m seeing
There’s this whole trend of using schema validation as the reward signal for RL training:
- OpenPipe ART - their production GRPO setup does exactly this
- rLLM project - how they trained DeepSWE to 59% on SWE-Bench
- dspy - have a fine tuning optimiser implementation which looks very interesting
The pattern is always the same:
- Generate response
- Try to validate against schema
- Success = reward 1.0, failure = reward 0.0
- Train with GRPO to maximize validation success rate
What if…
What if there was just a simple utility to turn your existing OTel traces into training data?
frompydantic_ai.experimentalimporttrain_from_traces# Point at your existing tracestraces=load_otel_traces("path/to/traces")# or from your observability backend# Train a LoRA adapter to improve validation success rateadapter=train_from_traces(traces=traces,base_model="meta-llama/Llama-3.1-8B",target_schema=MyResultType,method="grpo"# could support others later)# Use the improved modelagent=Agent(adapter,result_type=MyResultType)
Why this feels natural
You’re already doing the hard parts:
- ✅ Structured output definition (
result_type
) - ✅ Validation with rich error details
- ✅ Comprehensive tracing of everything
- ✅ Clean abstractions for model swapping
The RL training bit is just: “hey, here’s 1000 examples of what worked and what didn’t, please get better at the thing that worked.”
Super low-lift starting point
Could literally just be a script that:
- Reads OTel traces from pydantic-ai agents
- Extracts (prompt, response, validation_success) tuples
- Runs basic GRPO training (plenty of open implementations)
- Outputs a LoRA adapter
No changes to the core framework needed. Just a bridge between your excellent tracing and the RL training world.
The cool part is it would work retroactively - any agent you’ve been running and tracing could potentially be improved just from its historical data.
Worth exploring? The data collection problem is already solved, which is usually the hard part.
Extensions:
Once the validation RL is implemented then we could look into using the evals framework as an input dataset for more nuanced RL and improvements towards agent outcomes