Teaching an Agent From Outcomes: Reinforcement Learning for Multi-Step AI Processes

Prompt tuning and fine-tuning both require labelled examples of correct behaviour. But for complex multi-step agent workflows, you often cannot hand-label every intermediate step — you can only say whether the overall outcome was good. Reinforcement learning is exactly the right tool for that setting.

The labelling problem for multi-step agents

Imagine an agent that handles a customer support escalation: it reads a ticket, checks the account history, decides whether to look up additional context, drafts a response, and either sends it or routes to a human. At each step, the agent makes a decision. Some combinations of decisions lead to fast, correct resolutions. Others lead to slow, incorrect ones, or unnecessary human escalations.

How do you label this process for supervised fine-tuning? You would need an expert to review every intermediate step and mark it as correct or incorrect — in the context of all the steps before it. That is expensive, slow, and cognitively demanding. Worse, for many complex workflows, experts genuinely disagree about the right intermediate action. The right path through a multi-step process is often only visible in retrospect, once you know the outcome.

This is the problem reinforcement learning was designed for. Instead of requiring correct labels at every step, RL learns from outcomes: run the full episode, observe the result, score it, and update the policy to make better decisions next time. You need to be able to say whether the outcome was good — not whether each individual step was correct.

GRPO: how it works for language models

GRPO — Group Relative Policy Optimisation — is a reinforcement learning algorithm that works particularly well for LLM-based agents. The key idea is straightforward: instead of comparing the agent's behaviour to an absolute standard, you compare it to its own recent performance.

For each training step, the agent generates a group of rollouts — multiple independent attempts at the same task, starting from the same state. Each rollout is scored. The policy update then rewards the rollouts that performed better than the group average and discourages the ones that performed worse. The reward is relative, not absolute.

This has a significant practical advantage: it does not require a value network — a separate model that estimates expected future reward at each step. Value networks are notoriously difficult to train stably for language models. GRPO sidesteps this entirely. The relative comparison within the group provides the gradient signal directly, which makes the training loop substantially simpler and more stable.

The result is a policy that learns from its own variance: when it sometimes handles a situation well and sometimes handles it poorly, GRPO identifies the difference and pushes the policy toward the better approach.

RULER: outcome evaluation without human labelling at scale

For RL to work, you need a way to score outcomes. For simple tasks — "did the code pass the tests?", "is the extracted value in the right format?" — scoring is trivial. For more complex tasks — "was this response helpful?", "did this escalation decision make sense?" — you need something more sophisticated.

RULER is an LLM-judge framework that evaluates completed episodes using a separate language model as the scorer. Rather than having a human review every outcome, a judge model — prompted with evaluation criteria, the task context, and the agent's output — assigns a score. This scales to thousands of episodes per training step without requiring human time.

The judge model is not infallible, but it does not need to be. It needs to be calibrated well enough that the gradient signal it produces pushes the agent policy in the right direction on average. In practice, a well-prompted judge on a well-defined task is sufficiently accurate to drive meaningful improvement. The occasional misscored episode is noise; the aggregate signal is real.

Crucially, you can validate the judge offline: run it against a set of examples where you know the correct score, measure its agreement with human raters, and adjust the prompt until it is well-calibrated. You do this once per task, not once per episode.

OpenPipe ART: making this practical

OpenPipe ART (Agent Reinforcement Training) is the framework that makes GRPO-based RL training practical for Qwen and Llama-class models. It handles the training loop, rollout generation, reward computation, and policy update — the infrastructure that would otherwise take weeks to build and debug from scratch.

ART is designed for the specific case of fine-tuning open-weight models with reinforcement learning from outcome feedback. It integrates with the same model classes that LoRA fine-tuning targets, which means the RL-tuned policy can be initialised from a supervised fine-tuned checkpoint. In a full optimisation stack, fine-tuning and RL are not alternatives — they are sequential stages. You fine-tune to establish the baseline, then RL trains the policy on top of it.

What the policy learns to do

RL-trained agents develop behaviours that are genuinely difficult to capture in labelled examples. Because the training signal comes from complete episodes rather than individual steps, the policy learns multi-step strategies:

When to retry. A policy trained on outcome rewards learns that attempting a failed sub-task a second time — with a different approach — sometimes succeeds, and that the cost of the retry is worth it when the alternative is a failed episode.
When to escalate. The policy learns to recognise the signatures of situations where it is likely to fail, and to route those cases to human review before producing a bad output rather than after.
How to recover from partial failures. In multi-step workflows, an early mistake does not have to be fatal. An RL-trained policy learns recovery strategies — how to detect that something has gone wrong partway through and adapt the remaining steps accordingly.

These are the kinds of behaviours that make agents genuinely useful in production, rather than just impressive in demos. They require experience with complete episodes to learn. Supervised fine-tuning, however well-executed, cannot teach them — because labelled examples of correct intermediate behaviour do not capture the trade-offs and contingencies that only become visible across a full rollout.

What you get: a policy that compounds

The output of an RL training run is a policy — a model checkpoint — stored in your artifact registry. Like a fine-tuned checkpoint, it runs on your infrastructure and does not depend on an external API. Unlike a fine-tuned checkpoint, it was trained to make decisions across sequences of steps, not just to produce good individual outputs.

RL is the strategy that compounds most aggressively. Every production episode is a potential training signal. The system gets better the more it runs — not by accident, but by design.

This compounding dynamic is what makes RL fundamentally different from the other two tracks. Prompt tuning improves with each optimisation run you choose to execute. Fine-tuning improves when you collect new data and train a new checkpoint. RL improves continuously, as long as production traffic is flowing and the training loop is running. Each interaction adds to the pool of episodes from which the next policy update will be computed.

In a production system at scale, this means the policy running your agent in six months is materially better than the one you deployed at launch — not because anyone manually improved it, but because it has been learning from every episode it has handled.

If you have a multi-step agent workflow where outcomes are easier to define than correct intermediate steps, RL is worth serious consideration. Get in touch to talk through whether your use case is a good fit and what the setup would involve.