I Built BlazeText — It’s 10X Faster Than HuggingFace’s Tokenizer

Jul 29, 2025 | By Adarsh MS

A few weeks ago, while working on implementing a guardrail engine, I found myself staring at a performance graph that didn’t make any sense. Guardrail actions, like input sanitization, policy enforcement, hallucination checks, bias mitigation, audit logging: each layer adds complexity and latency. Left unchecked, those extra hops can nudge your p95 from tolerable to untenable.

At Bud, we recognized this trade‑off early on and chose to build our own guardrail library, Bud Sentinel— from scratch. I’ve led that library’s development from day one, and my sole mission has been clear: drive down the guardrail‑induced latency to the absolute minimum so that our overall endpoint performance remains fast and fluid.

To hit that goal, I profiled each stage of the guardrail pipeline to pinpoint the highest-cost checks and performed rigorous validation against our accuracy benchmarks. Then we optimized everything: moved the entire codebase from Python to Rust, further improved model performance using Burn, and even reengineered our RegEx and fuzzy pattern matching from the ground up.

And yet, there it was—a glaring 24ms response time for a 32K token context.

Our initial architecture for Bud Sentinel followed industry best practices and incorporated several innovative components: SIMD-optimized pattern matching, a high-speed static embedding that balanced accuracy with low memory usage and latency, semantic classifiers built on top of this embedding, and efficient parallel processing pipelines. Yet, we were seeing a latency of 24ms for a 32K context length—something I found unacceptable.

So, I set out to find what’s causing the latency. Initially I suspected the embedding lookups and their calculations. However, a detailed profiling uncovered an unexpected culprit. 

The Overlooked Problem : Tokenizers are still slow!!

Tokenization—the seemingly simple task was consuming a staggering 90% of that latency. Can you believe it !?

Tokenization is supposed to be the “solved” part of NLP pipelines. It’s just converting text to numbers, right? But the profiler doesn’t lie. Our HuggingFace tokenizer—widely considered one of the best—was taking over 20 milliseconds just to process the text.

Digging deeper, I found several issues:

  • Memory allocations were happening all over the place
  • The WordPiece algorithm was doing redundant work
  • Unicode operations were particularly slow
  • Normalization routines were more complex than necessary

What Needed to Change

Once I identified the issues—such as underoptimized implementations of the tokenization algorithm, memory overheads, and more—I did some research to figure out how to improve things. Here are the changes I made:

Optimized implementation of the WordPiece Algorithm

The existing implementation of WordPiece algorithm was optimized for production and included built-in parallelism, but it wasn’t designed with the ultra-low overhead requirements we were targeting. So I refactored the WordPiece algorithm, implemented caching mechanisms, and resolved various optimization issues I had discovered. As a result, the tokenization process became significantly leaner, with substantial reductions in overhead—achieved without sacrificing accuracy.

The optimized implementation of the WordPiece tokenization algorithm delivered substantial performance improvements. These optimizations span the entire tokenization pipeline, including normalization, pre-tokenization, model tokenization, and post-processing.

And I was quite surprised to see the results!!

This custom tokenizer is 6.6X faster in terms of overall throughput, processing 26,901 tokens per second compared to 4,077 tokens per second with the standard implementation. Moreover, across a range of text lengths, our tokenizer consistently outperforms standard tokenizers, demonstrating speedups ranging from 7.3X to 9.8X. For instance, at 16,000 tokens, we achieved a 9.9X speedup, reducing latency from 14.1 ms to 1.4 ms.

Chart above shows BlazeText WordPiece processes about 26,900 tokens/sec versus 4,100 tokens/sec for Hugging Face Tokenizers, delivering a ~6.6× higher throughput.

The chart above shows that tokenization latency grows roughly exponentially with sequence length for all methods, rising from tens of microseconds at 512 tokens to tens of milliseconds at 64K tokens. Across every sequence length, the “Fast” variants process inputs about 8–10X faster than their standard counterparts. Across all sequence lengths, BlazeText WordPiece Fast delivers the lowest latency, making it the best performer among the four methods.

These gains were achieved through a combination of techniques, including better parallelisation and batch processing, optimized lookups and fast paths for common use cases and low overhead caches for reducing redundant computations, and more efficient string handling.

And of course, there was still room for improvement.

Optimized Unicode Character Classification

I found that, in existing implemendations, every character was being classified through the same slow path, whether it was a simple ASCII letter or a complex Unicode symbol. This one-size-fits-all approach was costly.

To accelerate the tokenization process, I implemented an optimized Unicode character classification system. This approach uses fast paths for ASCII characters through bitmap lookups, optimized range checks for large Unicode blocks, binary searches for non-ASCII characters, and thread-local LRU caching for frequently accessed characters. These enhancements result in 1.8X to 32.4X speed improvements over existing implementations. ASCII operations are approximately 12X faster, and punctuation detection is about 32X faster. For example, checking if ‘a’ is alphabetic now takes a single array lookup instead of a complex Unicode table search.

I did two implementations: a default optimized version designed for primarily ASCII text, which is on average 10X faster, and a bitflags-based implementation optimized for Unicode-heavy text, which performs composite category checks significantly faster—achieving an average speedup of 12.7X.

Optimized Unicode Normalization

The normalization process was thorough but inefficient, treating all text the same way regardless of its actual complexity. Unicode normalization ensures that text has a consistent representation, which is essential for accurate tokenization. So I implemented all four Unicode normalization forms—NFC, NFD, NFKC, and NFKD—with a strong emphasis on both speed and accuracy. The implementation incorporates ASCII fast paths, custom buffer management, optimized lookup tables, and zero-copy iterators.

The optimised normalization library is up to 3.4X faster than standard reference implementations. Specifically, for ASCII processing, it is 3.2X faster for short text and 3.4X faster for long text. For larger ASCII inputs (1KB and above), we achieve a 5.3X improvement in throughput.

Putting it all together; BlazeText. 

The table below summarizes the key results from all the optimizations. Most importantly, I was able to reduce our guardrail library’s latency from 24ms to under 5ms for 32K-token contexts—finally hitting the target. (Of course, the pursuit of even lower latency continues…) I’ve packaged all these optimizations into a single library and named it BlazeText.

BlazeText library contains a refactored WordPiece algorithm, implemented caching mechanisms, and resolved aforementioned optimization issues. Using it the tokenization process became significantly leaner, with substantial reductions in overhead—achieved without sacrificing accuracy.

CategoryMetricImprovement
Tokenization PerformanceThroughputIncreased from 4,077 to 26,901 tokens/sec
Latency for 16,000 tokensDropped from 14.1ms to 1.4ms
Performance across text lengths7× to 9× faster
Character ClassificationASCII operations12× faster
Punctuation detection32× improvement
Overall Unicode handling2× to 32× improvement (operation-dependent)
Normalization SpeedTypical text3× faster
Larger documents5× better throughput

The charts above shows BlazeText WordPiece Fast cuts tokenization latency by roughly 7–9X compared to Hugging Face’s Fast tokenizer and similarly outpaces HF Tokenizers across all sequence lengths (from ~120 µs vs. 200 µs at 512 tokens to ~2.8 ms vs. 26 ms at 32K tokens). Peak speedup is at 16K tokens (≈9.8× faster), with consistent ~7.3–9.8× gains across the board.

General Applications

While I built BlazeText to solve a specific problem, the optimizations apply broadly to any system where tokenization matters:

Real-time Applications
Chat systems, live translation, and interactive AI assistants all benefit from faster tokenization. When you’re trying to maintain a conversation, every millisecond counts.

High-throughput Processing
If you’re processing millions of documents, a 10x speedup in tokenization can significantly reduce infrastructure costs and processing time.

Edge Deployment
Faster tokenization means less CPU usage, which translates to better battery life and performance on mobile or embedded devices.

Streaming Systems
For applications processing continuous text streams, efficient tokenization prevents bottlenecks and reduces latency.

Lessons Learned

This project taught me a few things:

  1. Profile before optimizing. I would never have guessed tokenization was our bottleneck without measurement.
  2. Question assumptions. Just because something is widely used doesn’t mean it’s optimized for your use case.
  3. Understand your data. Most text is ASCII. Optimizing for the common case while handling the general case correctly yields huge wins.
  4. Small improvements compound. No single optimization gave us 10X. It was the combination of many 2X and 3X improvements.

The code still scales linearly with input length, the constant factor improvements make a real difference. Sometimes the biggest performance gains come from the places you least expect. In our case, it wasn’t the fancy ML models or complex algorithms that needed work. It was the humble tokenizer, quietly consuming 90% of our time budget.

Adarsh MS

I am a seasoned engineer with a strong foundation in Machine Learning and broad expertise spanning software engineering, databases, microservices, RPA, and data engineering. Since starting as an intern in embedded system automations, I have grown to lead advanced projects including multimedia search engines, NLP models, Generative AI research, and cross-functional clinical trial initiatives, while also contributing to RPA frameworks and MLOps. Currently, I’m focused on advancing Machine Learning research and open to meaningful collaborations.

Related Blogs

Open Source Update : Bud Symbolic AI
Open Source Update : Bud Symbolic AI

This week we published a new open-source project — Bud Symbolic AI, an open-source framework designed to bridge traditional pattern matching (like regex and Cucumber expressions) with semantic understanding driven by embeddings. It delivers a unified expression framework that intelligently handles single words, multi‑word phrases, dynamic parameters, and context‑aware validation by leveraging FAISS for efficient […]

What’s New in LLM Inference Optimization: Recent Advances and Techniques
What’s New in LLM Inference Optimization: Recent Advances and Techniques

Large Language Models (LLMs) are resource-intensive. Open-source models like LLaMA 2, Mistral 7B, Falcon 40B, and others offer flexibility for deployment on cloud, edge, or on-premise setups. However, for cost-effective deployments, inference optimization is a necessity. This report surveys recent inference optimization methods and best practices, focusing on open-source LLMs. We cover techniques to reduce […]

A Survey of parallelism strategies that can deliver better efficiency for your GenAI deployments.
A Survey of parallelism strategies that can deliver better efficiency for your GenAI deployments.

Generative AI unlocks incredible capabilities, but it doesn’t come cheap. Training and deploying large models like LLMs or diffusion models demand massive compute, making the total cost of ownership (TCO) a serious concern for teams building production-grade systems. To make GenAI cost-effective and scalable, you need to squeeze out every bit of performance from your […]

Product Update: Bud’s LLM Evaluation Framework 2.0
Product Update: Bud’s LLM Evaluation Framework 2.0

We have a major upgrade to our LLM Evaluation Framework — making it even more powerful, transparent, and scalable for enterprise AI workflows. As the adoption of LLMs accelerates, evaluating their performance rigorously and reliably across real-world tasks has never been more critical. Our new framework brings unprecedented flexibility and depth to benchmarking LLMs at […]