Six stages, seven fine-tuning methods, nine quantization formats and ten optimizers — plus the eight step-level primitives that make the training loop a first-class citizen.
Stages can be chained with full lineage preserved: continued pre-training → SFT → preference optimization, all on the same platform.
Causal-LM loss on token-streaming corpora to inject domain vocabulary and knowledge before instruction tuning.
Instruction-response pairs with completion-only loss masking. Chat templates per model family, auto-detected.
Train a reward model on preference data — a critic for PPO and a verifier in evaluation.
Proximal Policy Optimization with clipped gradients, group-relative advantage centering and KL-penalty stabilization.
Direct Preference Optimization with sigmoid, hinge, IPO, KTO-pair, ORPO and SimPO variants — no separate reward model.
Kahneman–Tversky Optimization for imbalanced positive/negative examples. Useful for safety alignment.
Method auto-tuning cascades from full fine-tuning down to QLoRA based on available VRAM.
All parameters trainable. Use when the GPU budget allows.
Low-rank adapters, rank 1–256 (default 32). Targets all-linear or attention-only.
4-bit base model + LoRA adapters. Fits a 70B model on a single 80GB GPU.
Decomposes weight updates into magnitude and direction. Better quality than LoRA at similar rank.
Different learning rates for A and B matrices. Outperforms LoRA on certain workloads.
Constrains updates to orthogonal transformations, preserving base-model behaviour while adapting.
Train only the top N layers. Fast, low-memory, classical baseline.
Quantization is what makes commodity hardware viable for serious training. Optimizers span standard, memory-efficient, advanced and research-derived choices.
Model, optimizer, scheduler, gradient-scaler, CPU + CUDA RNG state, gradient-accumulation count and the deferred-unscale flag are all preserved — so a paused job resumes bit-identically. No silent divergence. Critical for long runs and shared, capacity-constrained clusters.
Mixed-precision in bf16 / fp16 / fp32 with automatic fallback when hardware lacks the requested format.
Gradient checkpointing and accumulation, both validated against the chosen optimizer.
Flash Attention 2 and 3 with automatic version selection.
Custom CUDA kernels via Liger Kernel for accelerated forward and backward passes.
FP8 training with three backends (auto, MS-AMP, torchao) — the right backend per hardware generation.
Ten LR scheduler variants: cosine, linear, constant, cosine-with-restarts, polynomial, reduce-on-plateau, WSD, inverse-sqrt.
Custom GradScaler at scale 1024 for stable mixed-precision on long sequences.
Four-tier checkpoint persistence (GPU → CPU → disk → object storage) with keep-best-N, SHA-256 checksums and validation gates.
Each is a REST endpoint, an SDK method, and an MCP-callable tool — pure PyTorch with full state preservation, so a single forward-backward call carries the same auth, audit and encryption as a production pipeline.
forward()Forward pass. Returns logits and hidden states; loss when labels are provided.
backward()Backward pass with gradient-accumulation control. Returns gradient norm and overflow flag.
step()Apply optimizer + scheduler step with optional gradient clipping. Mixed-precision aware.
zero_grad()Clear accumulated gradients. Required between independent training iterations.
generate()Sampling-based generation returning text, tokens and per-token logprobs. The basis for RL rollouts.
logprobs()Per-token log-probabilities for target sequences. Used for KL-penalty and importance-sampling.
save()Full state checkpoint with RNG state preserved — the basis for bit-identical pause/resume.
load()Restore full training state. Multiple sessions pause and resume without corruption.
Step-level ops work over DDP, FSDP hybrid-shard, and DeepSpeed ZeRO-1 / ZeRO-2. ZeRO-3 is rejected at submit time — per-step gradient access is incompatible with its parameter sharding.
The renderer registry exposes a per-template extension flag, so KV-cache-friendly tokenization works for reasoning models out of the box — the precondition for cost-feasible multi-turn agentic RL.
Inner AdamW per node, then an outer Nesterov SGD synchronises a pseudo-gradient. The validator rejects incompatible configurations at submit time, and the bandwidth-savings simulator predicts savings before a job runs.
| Distributed strategy | Status | Notes |
|---|---|---|
| DDP | Compatible | Inner-loop DDP within each island; outer-loop DiLoCo across islands. Most common. |
| FSDP hybrid_shard | Compatible | Shards parameters within each island; DiLoCo syncs across islands. Supports very large models. |
| FSDP full_shard | Not compatible | Full sharding across all ranks conflicts with the per-island inner loop. |
| DeepSpeed ZeRO-1 / ZeRO-2 | Compatible | Optimizer-state (and gradient) sharding within each island is supported. |
| DeepSpeed ZeRO-3 | Not compatible | Parameter sharding across all ranks is incompatible. Validator rejects. |
diloco_inner_steps (default 100) · diloco_outer_lr (default 0.7) · diloco_outer_momentum (Nesterov, default 0.9) · diloco_num_islands · diloco_pseudo_gradient_dtype (fp16/bf16/fp32) — with sensible defaults that make it usable without deep tuning.