Architecture Deep-Dive

Disaggregated Embedding
Architecture

How we achieve 3.6-12x faster inference through tokenization-embedding disaggregation, inspired by Snowflake's Arctic Inference and LLM prefill-decode patterns.

12 x Max Speedup
90 % CPU Bottleneck Eliminated
1000 + Node Scaling

Strategic Benefits

90% Cost Reduction

Eliminate GPU idle time by decoupling tokenization from inference. Same throughput, fraction of the cost.

Linear Scaling

Scale tokenization and inference independently. Add CPU nodes for more requests, GPU nodes for more throughput.

Future-Proof

As GPUs get faster (H100, H200), CPU bottlenecks grow. Disaggregation becomes more valuable over time.

The Problem

Traditional embedding inference wastes 90% of time on CPU operations while GPUs sit idle.

Embedding Inference Timeline
Request ~5ms
Tokenize BLOCKING ~50ms
Batch ~10ms
GPU ~5ms
1

Sequential Tokenization

CPU tokenizes one request at a time while GPU sits idle waiting for work.

2

Slow Serialization

Protobuf encoding of embedding vectors adds significant latency to each response.

3

GPU Underutilization

Single model instance doesn't saturate GPU compute capacity, especially for small models.

Single-Node Architecture

Disaggregated tokenization with lock-free queues and multi-instance GPU execution.

TOKENIZATION CLUSTER (CPU Cores / BudTikTok) Worker 0 BudTikTok Worker 1 BudTikTok Worker 2 BudTikTok Worker 3 BudTikTok ... TOKEN BUFFER QUEUE (Lock-free MPMC) Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 GPU INFERENCE CLUSTER GPU 0 Model Instance 0 Model Instance 1 GPU 1 Model Instance 0 Model Instance 1 ... (Multi-Instance) 2 instances/GPU

Tokenization Cluster

  • N workers running in parallel (one per CPU core)
  • Each worker uses BudTikTok with Rayon for SIMD-optimized tokenization
  • 280-24,000x faster than HuggingFace tokenizers
  • Continuously feeds tokens to the buffer queue

Token Buffer Queue

  • Multiple Producer, Multiple Consumer queue architecture
  • Token IDs stored as Arrow arrays for zero-copy transfers
  • Token-budget batching: Groups sequences by total token count
  • Longest-first sorting for optimal GPU utilization

GPU Inference Cluster

  • 2+ model instances per GPU with separate CUDA streams
  • Better utilization for small embedding models (bge-small, all-MiniLM)
  • Round-robin batch assignment across instances
  • Achieves 16x throughput (Snowflake Arctic Inference)

Multi-Node Scaling

Scale to 1000+ nodes with independent tokenization and GPU pools connected by a global scheduler.

GLOBAL SCHEDULER (Conductor) Load-aware routing | Token ID caching | Auto-scaling | SLO admission control TOKENIZATION NODE (CPU-only VMs) C0 C1 C2 C3 C4 C5 C6 C7 TOKENIZATION NODE (CPU-only VMs) C0 C1 C2 C3 C4 C5 C6 C7 ... HIGH-SPEED NETWORK (InfiniBand / RDMA) GPU NODE 0 (8x H100/A100) GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 + Head Node CPU Tokenization GPU NODE 1 (8x H100/A100) GPU 0 GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 + Head Node CPU Tokenization ...

Four Key Optimizations

Each technique contributes multiplicatively to the overall speedup.

2-4 x

Tokenization Disaggregation

Snowflake Pattern

Separate tokenization into a parallel pipeline stage. GPU never waits for tokenization - token queue always has work ready.

# GPU always busy while True: batch = queue.get_batch() embeddings = model(batch)
1.5-2 x

Multi-Instance GPU

Arctic Inference

Run 2+ model instances per GPU with separate CUDA streams. Round-robin batch assignment maximizes GPU utilization.

# 2 instances per GPU for i in range(num_instances): stream = torch.cuda.Stream() model = load_model().cuda()
1.2-1.5 x

Vectorized Serialization

Zero-copy SIMD

Replace Protobuf with raw little-endian bytes. NumPy uses SIMD for the conversion, enabling 3-5x faster response serialization.

# SIMD-optimized embeddings.astype('<f4') .tobytes()
3.6-12 x

Combined Speedup

Multiplicative Effect

All optimizations work together multiplicatively. Real-world gains depend on workload characteristics and hardware configuration.

(2-4x) × (1.5-2x) × (1.2-1.5x) = 3.6-12x

Why Disaggregation Matters More for Faster GPUs

As GPUs get faster, CPU tokenization becomes a larger percentage of total time.

For H100+ GPUs, disaggregation is essential. Tokenization becomes the dominant bottleneck.

RTX 3080/4080

~5% Pipelining Benefit
Tokenization % of Time: 4.5%

A100

~19% Pipelining Benefit
Tokenization % of Time: 15.9%

H100

47% Pipelining Benefit
Tokenization % of Time: 32.1%

H200

94% Pipelining Benefit
Tokenization % of Time: 48.6%

Scaling Projections

Near-linear scaling efficiency to 1000+ nodes.

Nodes Theoretical Max Efficiency Effective Throughput
1 22,903 t/s 100% 22,903 t/s
10 229,030 t/s 85% 194,676 t/s
100 2.29M t/s 75% 1.72M t/s
1000 22.9M t/s 65% 14.9M t/s

Efficiency Loss Sources

  • Network overhead: Token transfer between nodes
  • Scheduling overhead: Load balancing computation
  • Queue contention: At very high scale

Query Flow

Watch a request flow through the disaggregated pipeline.

1

Request Arrives

Text input received by the global scheduler

2

Load-Aware Routing

Scheduler routes to least-loaded tokenization worker

3

BudTikTok Tokenization

Parallel SIMD tokenization (280-24,000x faster)

4

Token Buffer Queue

Token IDs added to lock-free MPMC queue

5

Token-Budget Batching

Batch assembler groups by total token count

6

GPU Inference

Multi-instance model processes batch in parallel

7

Vectorized Response

Little-endian bytes serialization (SIMD-optimized)

Ready to Optimize Your Embedding Infrastructure?

See how Latent Bud's disaggregated architecture can transform your inference workloads.