Disaggregated Embedding
Architecture
How we achieve 3.6-12x faster inference through tokenization-embedding disaggregation, inspired by Snowflake's Arctic Inference and LLM prefill-decode patterns.
Strategic Benefits
90% Cost Reduction
Eliminate GPU idle time by decoupling tokenization from inference. Same throughput, fraction of the cost.
Linear Scaling
Scale tokenization and inference independently. Add CPU nodes for more requests, GPU nodes for more throughput.
Future-Proof
As GPUs get faster (H100, H200), CPU bottlenecks grow. Disaggregation becomes more valuable over time.
The Problem
Traditional embedding inference wastes 90% of time on CPU operations while GPUs sit idle.
Sequential Tokenization
CPU tokenizes one request at a time while GPU sits idle waiting for work.
Slow Serialization
Protobuf encoding of embedding vectors adds significant latency to each response.
GPU Underutilization
Single model instance doesn't saturate GPU compute capacity, especially for small models.
Single-Node Architecture
Disaggregated tokenization with lock-free queues and multi-instance GPU execution.
Tokenization Cluster
- N workers running in parallel (one per CPU core)
- Each worker uses BudTikTok with Rayon for SIMD-optimized tokenization
- 280-24,000x faster than HuggingFace tokenizers
- Continuously feeds tokens to the buffer queue
Token Buffer Queue
- Multiple Producer, Multiple Consumer queue architecture
- Token IDs stored as Arrow arrays for zero-copy transfers
- Token-budget batching: Groups sequences by total token count
- Longest-first sorting for optimal GPU utilization
GPU Inference Cluster
- 2+ model instances per GPU with separate CUDA streams
- Better utilization for small embedding models (bge-small, all-MiniLM)
- Round-robin batch assignment across instances
- Achieves 16x throughput (Snowflake Arctic Inference)
Multi-Node Scaling
Scale to 1000+ nodes with independent tokenization and GPU pools connected by a global scheduler.
Four Key Optimizations
Each technique contributes multiplicatively to the overall speedup.
Tokenization Disaggregation
Snowflake Pattern
Separate tokenization into a parallel pipeline stage. GPU never waits for tokenization - token queue always has work ready.
# GPU always busy
while True:
batch = queue.get_batch()
embeddings = model(batch)
Multi-Instance GPU
Arctic Inference
Run 2+ model instances per GPU with separate CUDA streams. Round-robin batch assignment maximizes GPU utilization.
# 2 instances per GPU
for i in range(num_instances):
stream = torch.cuda.Stream()
model = load_model().cuda()
Vectorized Serialization
Zero-copy SIMD
Replace Protobuf with raw little-endian bytes. NumPy uses SIMD for the conversion, enabling 3-5x faster response serialization.
# SIMD-optimized
embeddings.astype('<f4')
.tobytes()
Combined Speedup
Multiplicative Effect
All optimizations work together multiplicatively. Real-world gains depend on workload characteristics and hardware configuration.
Why Disaggregation Matters More for Faster GPUs
As GPUs get faster, CPU tokenization becomes a larger percentage of total time.
For H100+ GPUs, disaggregation is essential. Tokenization becomes the dominant bottleneck.
RTX 3080/4080
A100
H100
H200
Scaling Projections
Near-linear scaling efficiency to 1000+ nodes.
| Nodes | Theoretical Max | Efficiency | Effective Throughput |
|---|---|---|---|
| 1 | 22,903 t/s | 100% | 22,903 t/s |
| 10 | 229,030 t/s | 85% | 194,676 t/s |
| 100 | 2.29M t/s | 75% | 1.72M t/s |
| 1000 | 22.9M t/s | 65% | 14.9M t/s |
Efficiency Loss Sources
- Network overhead: Token transfer between nodes
- Scheduling overhead: Load balancing computation
- Queue contention: At very high scale
Query Flow
Watch a request flow through the disaggregated pipeline.
Request Arrives
Text input received by the global scheduler
Load-Aware Routing
Scheduler routes to least-loaded tokenization worker
BudTikTok Tokenization
Parallel SIMD tokenization (280-24,000x faster)
Token Buffer Queue
Token IDs added to lock-free MPMC queue
Token-Budget Batching
Batch assembler groups by total token count
GPU Inference
Multi-instance model processes batch in parallel
Vectorized Response
Little-endian bytes serialization (SIMD-optimized)
Ready to Optimize Your Embedding Infrastructure?
See how Latent Bud's disaggregated architecture can transform your inference workloads.