Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Reinforcement Learning for Pydantic AI Agents #2202

Open
@benomahony

Description

@benomahony

Description

Idea: Your OpenTelemetry traces are perfect GRPO training data

Been playing around with pydantic-ai and realized something - you’ve already solved the hardest part of training agents with RL.

What you’ve already built

The OpenTelemetry integration is quietly brilliant for this. Every agent run gets traced with:

  • The full prompt/messages
  • The raw LLM response
  • Whether validation succeeded/failed
  • The specific validation errors if it failed

That’s literally a perfect GRPO training dataset sitting right there.

The connection I’m seeing

There’s this whole trend of using schema validation as the reward signal for RL training:

  • OpenPipe ART - their production GRPO setup does exactly this
  • rLLM project - how they trained DeepSWE to 59% on SWE-Bench
  • dspy - have a fine tuning optimiser implementation which looks very interesting

The pattern is always the same:

  1. Generate response
  2. Try to validate against schema
  3. Success = reward 1.0, failure = reward 0.0
  4. Train with GRPO to maximize validation success rate

What if…

What if there was just a simple utility to turn your existing OTel traces into training data?

frompydantic_ai.experimentalimporttrain_from_traces# Point at your existing tracestraces=load_otel_traces("path/to/traces")# or from your observability backend# Train a LoRA adapter to improve validation success rateadapter=train_from_traces(traces=traces,base_model="meta-llama/Llama-3.1-8B",target_schema=MyResultType,method="grpo"# could support others later)# Use the improved modelagent=Agent(adapter,result_type=MyResultType)

Why this feels natural

You’re already doing the hard parts:

  • ✅ Structured output definition (result_type)
  • ✅ Validation with rich error details
  • ✅ Comprehensive tracing of everything
  • ✅ Clean abstractions for model swapping

The RL training bit is just: “hey, here’s 1000 examples of what worked and what didn’t, please get better at the thing that worked.”

Super low-lift starting point

Could literally just be a script that:

  1. Reads OTel traces from pydantic-ai agents
  2. Extracts (prompt, response, validation_success) tuples
  3. Runs basic GRPO training (plenty of open implementations)
  4. Outputs a LoRA adapter

No changes to the core framework needed. Just a bridge between your excellent tracing and the RL training world.

The cool part is it would work retroactively - any agent you’ve been running and tracing could potentially be improved just from its historical data.

Worth exploring? The data collection problem is already solved, which is usually the hard part.

Extensions:

Once the validation RL is implemented then we could look into using the evals framework as an input dataset for more nuanced RL and improvements towards agent outcomes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp