Bud Model Foundry Deep dive · 01

Training Core & Bud Tinker

Six stages, seven fine-tuning methods, nine quantization formats and ten optimizers — plus the eight step-level primitives that make the training loop a first-class citizen.

Full-spectrum coverage

Six training stages — each a first-class workflow.

Stages can be chained with full lineage preserved: continued pre-training → SFT → preference optimization, all on the same platform.

Continued pre-training

Causal-LM loss on token-streaming corpora to inject domain vocabulary and knowledge before instruction tuning.

SFT

Supervised fine-tuning

Instruction-response pairs with completion-only loss masking. Chat templates per model family, auto-detected.

Reward modelling

Train a reward model on preference data — a critic for PPO and a verifier in evaluation.

PPO

Online RLHF

Proximal Policy Optimization with clipped gradients, group-relative advantage centering and KL-penalty stabilization.

DPO

Offline preference

Direct Preference Optimization with sigmoid, hinge, IPO, KTO-pair, ORPO and SimPO variants — no separate reward model.

KTO

Asymmetric preference

Kahneman–Tversky Optimization for imbalanced positive/negative examples. Useful for safety alignment.

Fine-tuning methods

Seven methods, one uniform configuration surface.

Method auto-tuning cascades from full fine-tuning down to QLoRA based on available VRAM.

Full FT

Highest quality · highest VRAM

All parameters trainable. Use when the GPU budget allows.

LoRA

Standard PEFT

Low-rank adapters, rank 1–256 (default 32). Targets all-linear or attention-only.

QLoRA

4-bit + adapters

4-bit base model + LoRA adapters. Fits a 70B model on a single 80GB GPU.

DoRA

Direction-decomposed

Decomposes weight updates into magnitude and direction. Better quality than LoRA at similar rank.

LoRA+

Asymmetric LRs

Different learning rates for A and B matrices. Outperforms LoRA on certain workloads.

OFT

Orthogonal FT

Constrains updates to orthogonal transformations, preserving base-model behaviour while adapting.

Layer freeze

Top-N freezing

Train only the top N layers. Fast, low-memory, classical baseline.

Fit any model on any GPU

Nine quantization formats · ten optimizer families.

Quantization is what makes commodity hardware viable for serious training. Optimizers span standard, memory-efficient, advanced and research-derived choices.

Quantization formats

BNB 4/8-bitGPTQAWQAQLMQuantoEETQ 8-bitHQQMXFP4FP8

Optimizer families

AdamWAdamW 8-bit / pagedLionSophiaAdaLoMoAdEMAMixGaLoreApolloBAdamAdafactor / SGD

Pause and resume with bit-exact reproducibility

Model, optimizer, scheduler, gradient-scaler, CPU + CUDA RNG state, gradient-accumulation count and the deferred-unscale flag are all preserved — so a paused job resumes bit-identically. No silent divergence. Critical for long runs and shared, capacity-constrained clusters.

Beyond the obvious

The engineering details that decide production usability.

Mixed-precision in bf16 / fp16 / fp32 with automatic fallback when hardware lacks the requested format.

Gradient checkpointing and accumulation, both validated against the chosen optimizer.

Flash Attention 2 and 3 with automatic version selection.

Custom CUDA kernels via Liger Kernel for accelerated forward and backward passes.

FP8 training with three backends (auto, MS-AMP, torchao) — the right backend per hardware generation.

Ten LR scheduler variants: cosine, linear, constant, cosine-with-restarts, polynomial, reduce-on-plateau, WSD, inverse-sqrt.

Custom GradScaler at scale 1024 for stable mixed-precision on long sequences.

Four-tier checkpoint persistence (GPU → CPU → disk → object storage) with keep-best-N, SHA-256 checksums and validation gates.

Bud Tinker · step-level control

The eight primitives of a training loop, as API endpoints.

Each is a REST endpoint, an SDK method, and an MCP-callable tool — pure PyTorch with full state preservation, so a single forward-backward call carries the same auth, audit and encryption as a production pipeline.

forward()

Forward pass. Returns logits and hidden states; loss when labels are provided.

backward()

Backward pass with gradient-accumulation control. Returns gradient norm and overflow flag.

step()

Apply optimizer + scheduler step with optional gradient clipping. Mixed-precision aware.

zero_grad()

Clear accumulated gradients. Required between independent training iterations.

generate()

Sampling-based generation returning text, tokens and per-token logprobs. The basis for RL rollouts.

logprobs()

Per-token log-probabilities for target sequences. Used for KL-penalty and importance-sampling.

save()

Full state checkpoint with RNG state preserved — the basis for bit-identical pause/resume.

load()

Restore full training state. Multiple sessions pause and resume without corruption.

Distributed Tinker & the renderer registry

Step-level ops work over DDP, FSDP hybrid-shard, and DeepSpeed ZeRO-1 / ZeRO-2. ZeRO-3 is rejected at submit time — per-step gradient access is incompatible with its parameter sharding.

The renderer registry exposes a per-template extension flag, so KV-cache-friendly tokenization works for reasoning models out of the box — the precondition for cost-feasible multi-turn agentic RL.

Bud DiLoCo

Distributed training over commodity Ethernet — configurable, validated up front.

Inner AdamW per node, then an outer Nesterov SGD synchronises a pseudo-gradient. The validator rejects incompatible configurations at submit time, and the bandwidth-savings simulator predicts savings before a job runs.

Distributed strategy	Status	Notes
DDP	Compatible	Inner-loop DDP within each island; outer-loop DiLoCo across islands. Most common.
FSDP hybrid_shard	Compatible	Shards parameters within each island; DiLoCo syncs across islands. Supports very large models.
FSDP full_shard	Not compatible	Full sharding across all ranks conflicts with the per-island inner loop.
DeepSpeed ZeRO-1 / ZeRO-2	Compatible	Optimizer-state (and gradient) sharding within each island is supported.
DeepSpeed ZeRO-3	Not compatible	Parameter sharding across all ranks is incompatible. Validator rejects.

Key parameters

diloco_inner_steps (default 100) · diloco_outer_lr (default 0.7) · diloco_outer_momentum (Nesterov, default 0.9) · diloco_num_islands · diloco_pseudo_gradient_dtype (fp16/bf16/fp32) — with sensible defaults that make it usable without deep tuning.

Explore the rest of the platform.

Agentic RL & Simplified ART