Architecture Deep-Dive

Disaggregated Embedding
Architecture

How we achieve 3.6-12x faster inference through tokenization-embedding disaggregation, inspired by Snowflake's Arctic Inference and LLM prefill-decode patterns.

12 x Max Speedup

90 % CPU Bottleneck Eliminated

1000 + Node Scaling

Strategic Benefits

90% Cost Reduction

Eliminate GPU idle time by decoupling tokenization from inference. Same throughput, fraction of the cost.

Linear Scaling

Scale tokenization and inference independently. Add CPU nodes for more requests, GPU nodes for more throughput.

Future-Proof

As GPUs get faster (H100, H200), CPU bottlenecks grow. Disaggregation becomes more valuable over time.

Technical Advantages

BudTikTok Integration

280-24,000x faster tokenization with Rust/SIMD, now running in parallel workers feeding a lock-free queue.

Multi-Instance GPU

Run 2+ model instances per GPU with separate CUDA streams. Better utilization for small embedding models.

Zero-Copy Transfers

Arrow arrays for tokenization output enable zero-copy transfer between Python processes and to GPU memory.

Operational Benefits

Independent Scaling

CPU tokenization nodes and GPU inference nodes scale separately. Use cheap CPU VMs for tokenization.

Ray Integration

Built on Ray for distributed computing. Easy deployment with auto-scaling based on queue depth.

Fault Tolerance

Decoupled architecture means tokenization failures don't crash GPU workers. Automatic retry and failover.

The Problem

Traditional embedding inference wastes 90% of time on CPU operations while GPUs sit idle.

Embedding Inference Timeline

Request ~5ms

Tokenize BLOCKING ~50ms

Batch ~10ms

GPU ~5ms

90% CPU BOUND

10% GPU

Sequential Tokenization

CPU tokenizes one request at a time while GPU sits idle waiting for work.

Slow Serialization

Protobuf encoding of embedding vectors adds significant latency to each response.

GPU Underutilization

Single model instance doesn't saturate GPU compute capacity, especially for small models.

Single-Node Architecture

Disaggregated tokenization with lock-free queues and multi-instance GPU execution.

Tokenization Cluster

N workers running in parallel (one per CPU core)
Each worker uses BudTikTok with Rayon for SIMD-optimized tokenization
280-24,000x faster than HuggingFace tokenizers
Continuously feeds tokens to the buffer queue

Token Buffer Queue

Multiple Producer, Multiple Consumer queue architecture
Token IDs stored as Arrow arrays for zero-copy transfers
Token-budget batching: Groups sequences by total token count
Longest-first sorting for optimal GPU utilization

GPU Inference Cluster

2+ model instances per GPU with separate CUDA streams
Better utilization for small embedding models (bge-small, all-MiniLM)
Round-robin batch assignment across instances
Achieves 16x throughput (Snowflake Arctic Inference)

Multi-Node Scaling

Scale to 1000+ nodes with independent tokenization and GPU pools connected by a global scheduler.

Four Key Optimizations

Each technique contributes multiplicatively to the overall speedup.

2-4 x

Tokenization Disaggregation

Snowflake Pattern

Separate tokenization into a parallel pipeline stage. GPU never waits for tokenization - token queue always has work ready.

# GPU always busy
while True:
    batch = queue.get_batch()
    embeddings = model(batch)

1.5-2 x

Multi-Instance GPU

Arctic Inference

Run 2+ model instances per GPU with separate CUDA streams. Round-robin batch assignment maximizes GPU utilization.

# 2 instances per GPU
for i in range(num_instances):
    stream = torch.cuda.Stream()
    model = load_model().cuda()

1.2-1.5 x

Vectorized Serialization

Zero-copy SIMD

Replace Protobuf with raw little-endian bytes. NumPy uses SIMD for the conversion, enabling 3-5x faster response serialization.

# SIMD-optimized
embeddings.astype('<f4')
    .tobytes()

3.6-12 x

Combined Speedup

Multiplicative Effect

All optimizations work together multiplicatively. Real-world gains depend on workload characteristics and hardware configuration.

(2-4x) × (1.5-2x) × (1.2-1.5x) = 3.6-12x

Why Disaggregation Matters More for Faster GPUs

As GPUs get faster, CPU tokenization becomes a larger percentage of total time.

For H100+ GPUs, disaggregation is essential. Tokenization becomes the dominant bottleneck.

RTX 3080/4080

Tokenization % of Time: 4.5%

A100

Tokenization % of Time: 15.9%

H100

Tokenization % of Time: 32.1%

H200

Tokenization % of Time: 48.6%

Scaling Projections

Near-linear scaling efficiency to 1000+ nodes.

Nodes	Theoretical Max	Efficiency	Effective Throughput
1	22,903 t/s	100%	22,903 t/s
10	229,030 t/s	85%	194,676 t/s
100	2.29M t/s	75%	1.72M t/s
1000	22.9M t/s	65%	14.9M t/s

Efficiency Loss Sources

Network overhead: Token transfer between nodes
Scheduling overhead: Load balancing computation
Queue contention: At very high scale

Query Flow

Watch a request flow through the disaggregated pipeline.

Request Arrives

Text input received by the global scheduler

Load-Aware Routing

Scheduler routes to least-loaded tokenization worker

BudTikTok Tokenization

Parallel SIMD tokenization (280-24,000x faster)

Token Buffer Queue

Token IDs added to lock-free MPMC queue

Token-Budget Batching

Batch assembler groups by total token count

GPU Inference

Multi-instance model processes batch in parallel

Vectorized Response

Little-endian bytes serialization (SIMD-optimized)

Ready to Optimize Your Embedding Infrastructure?

See how Latent Bud's disaggregated architecture can transform your inference workloads.

View Benchmarks Calculate Savings Contact Us

Disaggregated Embedding Architecture

Strategic Benefits

90% Cost Reduction

Linear Scaling

Future-Proof

Technical Advantages

BudTikTok Integration

Multi-Instance GPU

Zero-Copy Transfers

Operational Benefits

Independent Scaling

Ray Integration

Fault Tolerance

The Problem

Sequential Tokenization

Slow Serialization

GPU Underutilization

Single-Node Architecture

Tokenization Cluster

Token Buffer Queue

GPU Inference Cluster

Multi-Node Scaling

Four Key Optimizations

Tokenization Disaggregation

Multi-Instance GPU

Vectorized Serialization

Combined Speedup

Why Disaggregation Matters More for Faster GPUs

RTX 3080/4080

A100

H100

H200

Scaling Projections

Efficiency Loss Sources

Query Flow

Request Arrives

Load-Aware Routing

BudTikTok Tokenization

Token Buffer Queue

Token-Budget Batching

GPU Inference

Vectorized Response

Ready to Optimize Your Embedding Infrastructure?

Company

Product

Resources

Disaggregated Embedding
Architecture