Bud Model Foundry Deep dive · 02

Agentic RL & Simplified ART

Three RL modes, four environments, eight loss functions, ten graders and five recipes — with a teaching-metaphor API that opens agentic training to subject-matter experts.

The next frontier of post-training

Three RL training modes — the patterns frontier systems use.

Different agentic-training scenarios have different optimal patterns for orchestrating rollouts and training updates.

Sync

On-policy

Pipelined per-step training with group-relative advantage centering. Best for GRPO-style training, basic agentic RL and math reasoning.

Async

Off-policy

Producer-consumer with a bounded staleness queue; stale samples discarded. Best for multi-agent RL, large-scale code RL and long-horizon tasks.

Streaming

Minibatch

Parallel rollout workers stream into a queue; training begins when a minibatch fills, overlapping sampling and learning. Maximises GPU utilisation.

Unified registry

Eight loss functions for the RL stack.

Each is implemented with the technical details that matter for production training — not just textbook correctness.

LossDescription
Cross-entropy

Standard supervised loss. Used for SFT-style stages and reward-model training.

Label-smoothing

Cross-entropy with smoothed targets. Improves calibration on classification-style tasks.

Focal loss

Down-weights easy examples to focus learning on hard cases. Useful when reward is heavily skewed.

Importance-sampling

Reweights samples by the new/old policy probability ratio. Foundation for off-policy learning.

PPO

Clipped policy gradient (ε=0.2 default). The classical online RLHF loss.

CISPO

Clipped importance-sampled policy optimization with stop-gradient on the clipped ratio — used by recent frontier agentic systems.

DRO

KL-penalized REINFORCE. Useful when reward is well-shaped and policy gradient suffices.

Bradley-Terry

Pairwise preference loss. The foundation underlying DPO and reward-model training.

Built-in environments

Four environments — plus your own enterprise tools.

An environment is the substrate against which an agent generates rollouts and receives rewards. Custom environments register through a dedicated endpoint — typically MCP-wrapped enterprise systems.

Arithmetic

Random arithmetic with extracted numeric answers. Binary reward with optional format penalty. Useful for testing the RL stack itself.

Math reasoning (GSM8K)

Grade-school word problems with chain-of-thought. Answer extraction via regex with comma handling. The standard reasoning benchmark.

Code execution

Function-signature-plus-tests in a strict sandbox. AST validation blocks dangerous imports and forks; 5-second timeout, 256 MB cap. Reward = fraction of passing tests.

LLM-as-judge

Open-ended evaluation with a configurable rubric. Four score-parsing patterns; default rubric covers correctness, helpfulness, clarity and completeness.

Grader catalog

Ten graders — deterministic, learned and hybrid.

A grader converts a rollout into a reward signal.

exact_matchcontainsjson_validregexformatlength_boundscode_executionai_judge (RULER)tool_callauto

The tool-call grader — why it matters

Agentic systems live or die on tool-call accuracy. The Bud tool-call grader handles the formats production models actually emit — function-call tags, XML blocks, markdown JSON, raw JSON — with graded scoring: 0 for no call, partial credit (0.3 / 0.6) for the right function with imperfect arguments, full credit (1.0) for an exact match. Partial credit makes the gradient signal informative: an agent learns to refine, not just to get correct/incorrect feedback.

Simplified ART

A teaching metaphor that compiles to full RL.

Most people who know what an agent should do better are not researchers — they are customer-service leads, compliance officers, clinical specialists. Simplified ART lets them improve agents directly. The capability ceiling is unchanged; the access surface is dramatically widened.

ConceptWhat it does
Student

The model wrapper. Loads the base model, attaches a skill module (LoRA adapter), exposes chat() and improve(). The Student is what gets trained.

Coach

The orchestrator. Runs eval cycles, trains identified weaknesses, records progress, and auto-stops on plateau.

Curriculum

The training set, from JSONL, CSV, examples, HuggingFace datasets or inline. Path-traversal protection, 100k-example cap.

Recipe

Pre-built configs for five scenarios: Reasoning, Code Assistant, Customer Support, Tool Use and Safety Alignment (DPO). Each ships defaults, rubric and cost estimate.

Improvement Plan

User-friendly knobs that compile to full RL: practice modes, a drift-guard slider mapped to KL coefficient, effort presets, creativity slider, auto-stop.

Feedback Collector

Production signals to training data: thumbs become binary pairs, corrections become SFT examples, preferences become DPO pairs.

Reasoning Code Assistant Customer Support Tool Use Safety Alignment

See how it all runs & deploys.

Platform, Security & Deployment