Build Better AI Agents with RL & Fine-Tuning (Kyle from OpenPipe) [AI Tinkerers - "One-Shot"]

Build Better AI Agents with RL & Fine-Tuning (Kyle from OpenPipe)

Joe Heitzeberg
Joe Heitzeberg — AI Tinkerers - "One-Shot"
June 18, 2025

Some teams assume the safest path to reliable AI agents is to keep upgrading to ever-larger models but Kyle Corbitt from OpenPipe shows us that by fine-tuning a 14 billion-parameter open-source model with reinforcement learning (RL), it’s possible to cut error rates actually outperform the best models – o3 – on challenging tasks, while also slashing latency from 5.5s to 1s.

Why this matters

The difference between an enterprise app that can ship to real users and one that flops, or a startup that achieves product market fit and a failure, could well rest on reliability, latency and cost – especially for vertical agent and voice agent apps.

  • Reliability pain: 90 percent accuracy sounds good until your agent botches 1 in 10 emails.
  • Latency pain: Frontier models often push round-trip times past 5 seconds (or even more at peak times), which is unusable for voice or high-volume pipelines.
  • Cost pain: Every extra token on a premium API shows up on the invoice.

Corbitt walks us through a weekend project, detailed in the video and in this post, which tackles all three issues. Using the Enron email corpus as a sandbox, he built a natural-language email search agent, then fine-tuned with RLHF-style feedback to hit 96 percent accuracy and 1 second latency.

“Even 90 percent felt low for simple factual queries, so we brought in fine-tuning and got to 96 percent—that’s a 60 percent drop in error rate.”
– Kyle

How it works (the short version)

Synthetic Q&A generation

  • Feed Gemini 2.5 Pro 20-email chunks.
  • Prompt it to write realistic questions, gold-standard answers, and a self-scored ‘realism’ rating.
  • Keep only samples with realism ≥ 0.9.

Reward function with partial credit

  • +0.1 if the agent’s search query surfaces the right email.
  • +0.1 more if it reads the right email.
  • Big bonus for the correct final answer, big penalty for hallucinating.
  • This scaffolding lets the model learn useful sub-skills before mastering the full task.

Rollouts and GRPO training

  • For each scenario, run six rollouts to estimate task difficulty.
  • Update weights only when behaviour genuinely improves, using OpenPipe’s open-source Agent Reinforcement Trainer (ART).

Tiny model, big result

  • Base: a 14 B open model (fits on a single A100).
  • After RL: 96 percent correct, ~1 s average response, lower GPU bill.

Replicating these results at home

A great way to learn these techniques, of course, is to roll up your sleeves and try them at home. Kyle warns that for someone well-versed in Python and generative AI application building might still take more than a weekend to have done the work that he did, and suggests allocating several hours spread over the course of a couple of weeks. That said, all of the code and steps are below in case you would like to speed run things.

https://github.com/OpenPipe/email-deep-research - The project, called “Art•E(mail)”, trains AI models using reinforcement learning to search through email datasets and answer user queries in natural language. The goal is to create agents that could integrate with email providers (like Gmail plugins) to let users ask questions such as “what time does my wife’s flight arrive on Friday” and receive accurate answers based on their email content, using the Enron Email Dataset for training and benchmarking against commercial LLMs.

https://github.com/OpenPipe/ART - ART (Agent Reinforcement Trainer) is an open-source library that uses GRPO reinforcement learning to improve LLM performance in multi-turn agentic tasks by training models from their own experiences. It features a client-server architecture that lets you run agent workflows in your existing codebase while the backend handles the complex RL training loop, with examples ranging from game-playing agents (2048, Tic Tac Toe) to real-world applications like email retrieval.

Try it, tweak it, share it

Corbitt’s main lesson isn’t “use this exact model”; it’s “treat your agent like a new hire—give it feedback until it behaves.” With open tools and public data, you can prototype the loop this week, then swap in your own inbox, support tickets, or claim forms.

Where this fits in the fast-moving landscape

  • Frontier vs. focus: While Anthropic’s new Claude 4 Opus dazzles with benchmark wins, many production teams don’t need 175B parameters. RL fine-tuning lets you stay small without losing quality – and these RL techniques with fine-tuning can let you achieve state-of-the-art results in your domain that may be unattainable with the frontier models.
  • Open ecosystem: ART is Apache-2.0. No vendor lock-in, and your weights run anywhere—from on-prem clusters to consumer GPUs. For applications in many fields, including enterprise, pharmaceutical, and security, this could even be essential.

The next generation of agents will not just call tools, they will earn trust. Reinforcement learning is how we get there.

Comments

Ready for more?

Check out other posts from this blog.

View all posts