Bud Model Foundry

The sovereign training platform for the agentic enterprise.

Build, fine-tune, post-train and agentic-train open models on your own infrastructure — on the GPUs you already own, with research-grade control and production-grade operations.

Request a demo Talk to engineering Download whitepaper →

At a glance

Bud Model Foundry, in numbers.

The headline figures behind the platform — capability surface, performance commitments, and operational breadth.

118+

Supported open-weight models

500×

Inter-node bandwidth reduction via DiLoCo

350+

Platform REST endpoints

GPU vendors — NVIDIA, AMD, Qualcomm, Intel

Training stages — PT, SFT, RM, PPO, DPO, KTO

Quantization formats — BNB, GPTQ, AWQ, AQLM, FP8 and more

Graders, including LLM-as-judge and tool-call

260+

Data operators across text, image, audio, video, code

Design commitments

Five non-negotiables, woven through every layer.

Each commitment maps to a specific failure of the existing market — and each one is a feature of the platform from the foundation up, not a checkbox added in a release note.

Sovereignty by deployment

One-command on-premise install. No outbound dependency. AES-256-GCM encryption at rest. Air-gapped operation as a first-class deployment pattern.

Multi-vendor by design

NVIDIA, AMD, Qualcomm and Intel GPUs. PCIe form factors as first-class hardware. Mixed-vendor fleets supported within a single training job.

Agentic-first by purpose

Three RL training modes, four built-in environments, ten graders, five recipes, and a teaching-metaphor API for non-researcher operators.

End-to-end by scope

Data prep, training, RL, OpenAI-compatible serving, model registry with lineage, drift detection, feedback collection — in one platform, one auth surface, one audit log.

Production-grade from day one

API key authentication, OAuth/OIDC, RBAC with model-level policies, atomic quotas, rate limiting, structured audit, Prometheus metrics. Engineered as a platform.

What you can do with it

End-to-end training workflows, all inside your perimeter.

The concrete things your team can run on day one — not as separate tools stitched together, but as first-class workflows on a single platform.

Fine-tune open models on domain data

Take a 7B–70B model and apply Full FT, LoRA, QLoRA, DoRA, LoRA+, OFT or top-N freeze through one configuration surface.

ii.

Continue pre-training on proprietary corpora

Inject domain vocabulary and knowledge into a base model with causal-language-modelling loss before instruction tuning.

iii.

Train custom reward models

Train a separate reward model on your preference data, then run online RLHF or DPO to align a base model to your standards.

iv.

Train agents end-to-end with reinforcement learning

Run RL against your own tools, APIs, and environments — with verifiable rewards, LLM-as-judge graders, or hybrid combinations.

Run improvement sessions on deployed models

Take a deployed model, evaluate against a curriculum, identify weaknesses, and post-train to address them — through the Simplified ART API.

vi.

Curate large training corpora

Process and filter through 260+ data operators with distributed Ray-based execution and full reproducibility.

vii.

Serve fine-tuned models

OpenAI-compatible endpoints with multi-tenant adapter routing — hundreds of custom adapters from a single base model.

viii.

Track everything in a model registry

Full lineage from dataset to checkpoint, with statistical drift detection across five algorithms once the model is in production.

Capability map

Twelve capability pillars, one platform.

Each pillar addresses a specific operational need. Together they constitute a single platform with one authentication surface, one audit log, and one observability layer.

Bud Tinker

Step-level training control. Eight primitives exposed as REST endpoints with full state preservation.

Explore →

Bud RL Engine

Three training modes, eight loss functions, four environments. The substrate for agentic training.

Explore →

Simplified ART

Teaching-metaphor API. Student, Coach, Curriculum, Grader — agentic training for non-researchers.

Explore →

Bud DiLoCo

Bandwidth-efficient distributed training. 100×–500× less inter-node bandwidth.

Explore →

Training Core

Six stages, seven methods, nine quantization formats, ten optimizer families.

Explore →

Data Pipeline

260+ operators across modalities. Distributed Ray execution. Reproducible versioning.

Explore →

Inference Engine

OpenAI-compatible serving with multi-tenant LoRA. Hundreds of adapters per base model.

Explore →

Model Registry

PostgreSQL-backed registry with semantic versioning, lineage, aliases, and audit.

Explore →

Bud Simulator

Pre-flight memory, time, cost, and hardware-fit prediction across 20+ GPU types.

Explore →

Production Ops

Auth, RBAC, encryption, quotas, rate limits, audit, eight-priority graceful shutdown.

Explore →

Developer Experience

Five interfaces — Python SDK, REST, dashboard, server TUI, MCP server.

Explore →

Drift & Feedback

Five drift algorithms. Production signals converted to SFT and DPO training data.

Explore →

Five interfaces

One platform, five ways to drive it.

Same auth, same audit log, same governance — whichever interface you choose. Pick the one tuned to your persona.

Python SDK

Researchers · ML engineers

Sync and async clients with feature parity. Fluent builders for training, LoRA, DiLoCo, QLoRA configs.

REST API

Platform integrators

350+ endpoints with OpenAPI spec. Idempotency keys, webhooks, WebSocket subscriptions for live metrics.

Web Dashboard

Operators · managers · SMEs

35-page Next.js GUI with progressive disclosure. Visual data-pipeline DAG editor and Tinker Lab.

Server TUI

Site-reliability engineers

Textual-based terminal UI in any SSH session. Service health, GPU gauges, log tailing, air-gapped friendly.

MCP Server

Autonomous agents

Training capabilities exposed as MCP tools. Agents drive their own improvement loops with full audit governance.

Explore the developer experience in depth

Architecture & performance

Engineered for graceful degradation and operational independence.

Seven layers, each independently scalable. The core training and RL capabilities run on pure PyTorch and stay available even when optional high-level components are not. Each layer is monitored, secured and upgraded on its own schedule.

3×

Inference

Throughput vs baseline tokens/sec on supported hardware.

<100ms

Time to first token

High-throughput serving with paged attention and continuous batching.

100–500×

DiLoCo

Inter-node bandwidth reduction. Up to 4,800× with int4 + adapter sync.

<5s

Drift detection

Per-batch latency across PSI, KL, JS, KS and chi-square.

Read the full architecture

Layer 7

Consumption

SDK · REST · Dashboard · TUI · MCP · OpenAI clients

Layer 6

Gateway

FastAPI middleware: request-ID, idempotency, rate-limit, auth, RBAC, CORS

Layer 5

Execution

Celery workers · in-process pipelines · background schedulers

Layer 4

Core engines

Bud Tinker · Training Pipelines · RL Engine · Simplified ART · DiLoCo

Layer 3

Platform subsystems

Data Pipeline · Inference Engine · Model Registry · Drift · Feedback

Layer 2

Cross-cutting services

Auth · encryption · audit · cost tracking · notifications · idempotency

Layer 1

Persistence

PostgreSQL · Redis · MinIO/S3 · external IdP

How it compares

The market splits into five archetypes.
Bud Model Foundry sits in the fifth.

An honest, capability-by-capability comparison against the four major training-platform archetypes. Full matrix and head-to-head positioning lives on the comparison page.

Requirement

Hosted

Hyperscaler

DIY OSS

Bud Foundry

Sovereignty / data residency

Fails

Partial

Pass

Predictable cost at scale

Per-token

GPU-hour + egress

CapEx

License

Multi-vendor GPU support

Single

Provider catalog

DIY

4 vendors

Agentic RL stack built-in

Limited

Partial

DIY (6–12 mo)

In-box

Time to first production job

Days

Weeks

6–12 months

Days

Air-gapped deployment

Possible

First-class

Lifecycle scope

Training only

Provider stack

Possible

Full lifecycle

See the full comparison and head-to-head positioning

Who it's for

Six audiences, six different ways to win.

The platform speaks differently to each audience. Pick the one that matches your organisation and read the use case in your register.

BFSI

Banking, Financial Services & Insurance

Compliance copilots, fraud-detection reasoning, customer service agents, loan-origination assistants — on-premise with full audit.

Open page → Healthcare

Healthcare & Life Sciences

Clinical reasoning agents, radiology assistants, drug-discovery models, federated training across hospital consortia.

Open page → Defence & Government

Sovereign-AI & air-gapped programmes

Intelligence-analysis agents, citizen-service multilingual agents, cyber-defence reasoning — in air-gapped environments.

Open page → Cloud & Telco

Tier-2 / Tier-3 Cloud Providers

Sovereign AI Platform-as-a-Service, vertical-specialised AI services, multi-tenant fine-tuning, cost-leadership AI.

Open page → Existing GPU CapEx

Enterprises with existing GPU investments

Production training on PCIe clusters with auto-configuration. Multi-node distributed training over commodity Ethernet via DiLoCo.

Open page → Research & agentic teams

Research-led & agentic-AI product teams

Step-level training control, custom loss functions, custom RL environment registration — the research stack with production controls.

Talk to us →

Deployment options

Deployable into any environment you operate.

No hosted dependency. No required outbound connection. No telemetry leaving your perimeter. Pick the pattern that matches the context.

Single-node Docker Compose

Pilot · development · single-team production

Eight services in containers (API, worker, frontend, PostgreSQL, Redis, MinIO, identity provider, RSA key bootstrap). Deploy in 30 minutes via the bud-install command-line tool.

Kubernetes via Helm

Multi-team production · sovereign cloud

Production-grade with horizontal scaling. Helm chart with HPA, PDB, network policies, persistent volumes, four conditional Bitnami subcharts. Standard K8s liveness and readiness probes.

Air-gapped on-premise

Maximum sovereignty · defence · classified

Same Helm chart with offline image staging. All artefacts pre-positioned, container registry mirrored, no outbound dependency. The default deployment pattern for sovereign-AI mandates.

See deployment, security & operations in detail

Engagement models

Three ways to engage with the platform.

Procure the way your organisation prefers. Each model meets a different operating posture.

Model 01

Self-managed software license

Annual or multi-year software license. Bud delivers the software, documentation, and support. Your team handles deployment and operations end-to-end.

Talk to us

Model 02

Managed deployment

Software license plus a managed-services engagement. The Bud team handles deployment, configuration, upgrades and day-2 operations alongside your team.

Talk to us

Model 03

Strategic partnership

Multi-year partnership combining Bud Model Foundry, AI Foundry and other Ecosystem components, with co-engineered solutions for your specific use cases.

Talk to us

Build the AI you actually need.

Run on the GPUs you actually own. Train inside the perimeter your governance team requires. Deploy with production-grade authentication, audit, encryption and operations on day one.

Request a demo Talk to engineering

The sovereign training platform for the agentic enterprise.

Bud Model Foundry, in numbers.

Five non-negotiables, woven through every layer.

Sovereignty by deployment

Multi-vendor by design

Agentic-first by purpose

End-to-end by scope

Production-grade from day one

End-to-end training workflows, all inside your perimeter.

Fine-tune open models on domain data

Continue pre-training on proprietary corpora

Train custom reward models

Train agents end-to-end with reinforcement learning

Run improvement sessions on deployed models

Curate large training corpora

Serve fine-tuned models

Track everything in a model registry

Twelve capability pillars, one platform.

One platform, five ways to drive it.

Python SDK

REST API

Web Dashboard

Server TUI

MCP Server

Engineered for graceful degradation and operational independence.

The market splits into five archetypes.Bud Model Foundry sits in the fifth.

Six audiences, six different ways to win.

Banking, Financial Services & Insurance

Healthcare & Life Sciences

Sovereign-AI & air-gapped programmes

Tier-2 / Tier-3 Cloud Providers

Enterprises with existing GPU investments

Research-led & agentic-AI product teams

Deployable into any environment you operate.

Single-node Docker Compose

Kubernetes via Helm

Air-gapped on-premise

Three ways to engage with the platform.

Self-managed software license

Managed deployment

Strategic partnership

Build the AI you actually need.

Company

Product

Resources

The market splits into five archetypes.
Bud Model Foundry sits in the fifth.