Three RL modes, four environments, eight loss functions, ten graders and five recipes — with a teaching-metaphor API that opens agentic training to subject-matter experts.
Different agentic-training scenarios have different optimal patterns for orchestrating rollouts and training updates.
Pipelined per-step training with group-relative advantage centering. Best for GRPO-style training, basic agentic RL and math reasoning.
Producer-consumer with a bounded staleness queue; stale samples discarded. Best for multi-agent RL, large-scale code RL and long-horizon tasks.
Parallel rollout workers stream into a queue; training begins when a minibatch fills, overlapping sampling and learning. Maximises GPU utilisation.
Each is implemented with the technical details that matter for production training — not just textbook correctness.
| Loss | Description |
|---|---|
| Cross-entropy | Standard supervised loss. Used for SFT-style stages and reward-model training. |
| Label-smoothing | Cross-entropy with smoothed targets. Improves calibration on classification-style tasks. |
| Focal loss | Down-weights easy examples to focus learning on hard cases. Useful when reward is heavily skewed. |
| Importance-sampling | Reweights samples by the new/old policy probability ratio. Foundation for off-policy learning. |
| PPO | Clipped policy gradient (ε=0.2 default). The classical online RLHF loss. |
| CISPO | Clipped importance-sampled policy optimization with stop-gradient on the clipped ratio — used by recent frontier agentic systems. |
| DRO | KL-penalized REINFORCE. Useful when reward is well-shaped and policy gradient suffices. |
| Bradley-Terry | Pairwise preference loss. The foundation underlying DPO and reward-model training. |
An environment is the substrate against which an agent generates rollouts and receives rewards. Custom environments register through a dedicated endpoint — typically MCP-wrapped enterprise systems.
Random arithmetic with extracted numeric answers. Binary reward with optional format penalty. Useful for testing the RL stack itself.
Grade-school word problems with chain-of-thought. Answer extraction via regex with comma handling. The standard reasoning benchmark.
Function-signature-plus-tests in a strict sandbox. AST validation blocks dangerous imports and forks; 5-second timeout, 256 MB cap. Reward = fraction of passing tests.
Open-ended evaluation with a configurable rubric. Four score-parsing patterns; default rubric covers correctness, helpfulness, clarity and completeness.
A grader converts a rollout into a reward signal.
Agentic systems live or die on tool-call accuracy. The Bud tool-call grader handles the formats production models actually emit — function-call tags, XML blocks, markdown JSON, raw JSON — with graded scoring: 0 for no call, partial credit (0.3 / 0.6) for the right function with imperfect arguments, full credit (1.0) for an exact match. Partial credit makes the gradient signal informative: an agent learns to refine, not just to get correct/incorrect feedback.
Most people who know what an agent should do better are not researchers — they are customer-service leads, compliance officers, clinical specialists. Simplified ART lets them improve agents directly. The capability ceiling is unchanged; the access surface is dramatically widened.
| Concept | What it does |
|---|---|
| Student | The model wrapper. Loads the base model, attaches a skill module (LoRA adapter), exposes chat() and improve(). The Student is what gets trained. |
| Coach | The orchestrator. Runs eval cycles, trains identified weaknesses, records progress, and auto-stops on plateau. |
| Curriculum | The training set, from JSONL, CSV, examples, HuggingFace datasets or inline. Path-traversal protection, 100k-example cap. |
| Recipe | Pre-built configs for five scenarios: Reasoning, Code Assistant, Customer Support, Tool Use and Safety Alignment (DPO). Each ships defaults, rubric and cost estimate. |
| Improvement Plan | User-friendly knobs that compile to full RL: practice modes, a drift-guard slider mapped to KL coefficient, effort presets, creativity slider, auto-stop. |
| Feedback Collector | Production signals to training data: thumbs become binary pairs, corrections become SFT examples, preferences become DPO pairs. |