A seven-layer architecture, production-grade serving and lifecycle, a security model built for regulated workloads, and deployment from a single-node pilot to a fully air-gapped estate.
The inference engine ships in the same platform as training — same auth, same audit log, same registry.
Drop-in /v1/chat/completions by changing the base URL. Continuous batching, paged attention, tensor + pipeline parallelism, six parameter presets and token streaming.
Hundreds of adapters from a single base model with native per-request routing and no cold-start cost — the fixed cost of a 70B base amortised across hundreds of tenants.
PostgreSQL-backed, semantic versioning, full parent_version_id lineage, aliases (@latest, @production, @staging), training metrics and per-model audit trail.
Five algorithms — PSI, KL, Jensen-Shannon, Kolmogorov-Smirnov, chi-square — with severity bucketing, fixed/sliding/tumbling/ADWIN windows, and worst-case aggregation feeding the alert system.
Graceful degradation: core training and RL run on pure PyTorch and stay available even when optional high-level components are not.
| Metric | Target | Mechanism | Status |
|---|---|---|---|
| Inference throughput | 3× baseline | Paged attention + continuous batching | Validated |
| Time to first token | < 100 ms | High-throughput serving engine | Validated |
| DiLoCo bandwidth reduction | 100–500× | Inner AdamW + outer Nesterov SGD | Validated |
| Drift detection latency | < 5 s | PSI · KL · JS · KS · chi-square | Validated |
| Registry version creation | < 1 s | PostgreSQL-backed registry | Validated |
| Per-job memory growth | 0 MB avg | Two-tier GPU memory cleanup | Validated |
The installer auto-detects OS, GPU vendor, driver and runtime, then selects the correct PyTorch wheel automatically.
Security is woven through every layer. All operations are auditable, all data encryptable, all access governable.
No hosted dependency, no required outbound connection, no telemetry leaving your perimeter.
Eight services in containers via bud-install. Pilot, development and single-team production — deploy in 30 minutes.
Production-grade with HPA, PDB, network policies and persistent volumes. Horizontal scaling for multi-team estates.
Same Helm chart with offline image staging. All artefacts pre-positioned, registry mirrored, no outbound dependency.
WebSocket metric streaming, a public Prometheus /metrics endpoint, structured logging with auto-redaction, K8s liveness/readiness probes, and per-job / per-team cost attribution with line-item invoices.
Zombie-job detection, GPU reservation tracking, maintenance mode, eight-priority graceful shutdown, background schedulers for key rotation and recovery, and the server TUI for terminal-based control.