Introducing

Bud Cache

Accuracy-first LLM response caching powered by Resource Aware Attention

Reuse answers your model has already generated — safely, instantly and on ordinary CPUs — cutting inference cost and latency without ever serving a wrong answer.

The problem

Why caching matters and where it breaks

Why?

Lower bills, faster answers

Cut inference costs on repeat questions

Near-instant response times

More GPU capacity for new work

Why today's caches fall short

They give you safety or savings, never both

Exact-match caches almost never hit

Semantic caches serve wrong answers

One threshold can't do safety and savings

What is Bud Cache

The enterprise-grade, accuracy-first response cache for Large Language Models

When a request arrives, Bud Cache decides — in under a millisecond — whether it has already produced a trustworthy answer to a genuinely equivalent request. If so, it returns that answer instantly and the expensive model call is avoided. If not, the request passes through and the new answer is learned for next time.

92%
Decision accuracy
1.5%
Wrong-answer rate
79%
Useful hit rate
0.2ms
Median hit latency

Measured on the sealed CacheBench held-out set (400 pairs); latency on a single commodity CPU core. On the same benchmark, GPTCache 0.1.44 scores 52.6% accuracy and a 68.2% wrong-answer rate.

How it works

Bud Cache reuses what’s equivalent — not what merely looks similar

Sitting in front of your model, Bud Cache reads the parts of a request that actually decide the answer — entities, quantities, constraints, negations and intent — and reuses an answer only when two requests are genuinely equivalent.

Request
Bud Cache Equivalence check
entities quantities constraints negation intent
<1 ms · CPU-only
hit Response from cache ≈0.2 ms p99 ≈1 ms
miss Model call ≈1,500 ms
Response from model returned to caller
cached for next time

A hit is ~7,500× faster than a model call — and ~285× faster than a transformer semantic-cache lookup (~57 ms).

Biased toward caution by design: a missed reuse costs one model call — a wrong reuse costs trust.

Key Results

Outperforming the best cache systems in market today.

01

Near-zero wrong-answer rate

Reuse answers without the risk of serving a confidently incorrect one — the safety property that makes caching deployable in customer-facing products.

02

High useful coverage

Reuses across genuine paraphrases and re-phrasings, not just identical strings — so savings are real, not theoretical.

03

Sub-millisecond, CPU-only

No GPU required for the cache tier and no meaningful latency added. Cheap to run and easy to place anywhere in the stack.

04

A statistical safety guarantee

The wrong-answer rate is held under a target you set — with a mathematical (PAC) bound, not just a hopeful threshold.

05

Self-tuning, zero-config

Calibrates itself to each workload and keeps the safety target on track automatically — no manual threshold tuning, no ML team required to operate it.

06

One-knob strictness

A single Lenient / Balanced / Strict setting trades coverage for caution to match the risk appetite of the use case; sensible defaults out of the box.

Head-to-head, measured

Bud Cache vs. GPTCache

GPTCache 0.1.44 Bud Cache
Accuracyhigher is better
52.6%
92.0%
Wrong-answer ratelower is better
68.2%
1.5%
Adversarial wrong-answer ratelower is better
96.3%
2.4%
Coverage (recall)higher is better
93.3%
79.4%
~285× faster Latency per request GPTCache ~57 ms (transformer embedder)  ·  Bud Cache 0.2 ms (commodity CPU, no GPU)
Wrong answers traded for coverage
GPTCache — every threshold setting Bud Cache (measured) 5% risk ceiling
0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Coverage — share of true matches reused → Wrong-answer rate ↑ 5% risk ceiling GPTCache — every threshold Bud Cache (measured) 79% coverage · 1.5% wrong

Every threshold setting available to a single-similarity cache trades wrong answers for coverage. Bud Cache operates below that curve entirely — high useful coverage with a wrong-answer rate held under the risk ceiling.

No cache vs Exact-match cache vs naive semantic cache vs Bud Cache
Capability No cache Exact-match KV GPTCache / naive semantic Bud Cache
Reuses paraphrasesNoNoYesYes
Useful hit rateNoneVery lowHigh only if unsafeHigh (79%)
Wrong-answer riskNoneNoneHigh (68%)Near zero (1.5%)
Statistical safety guaranteeNoneYes (bounded false-hit)
Multi-tenant isolationManualNot built-inStrict, built-in
Auth & rate-limitingYes (mature)No (library)Secure by default
GDPR erasure (survives restore)Delete onlyNoYes (tombstoned)
Data / log retention age-outTTL onlyNoConfigurable
Encryption at restYes (enterprise)No0600 + volume/KMS seam
Freshness / time-sensitivityNoNoBuilt in
Standard cache-control (RFC 9111)NoNoYes
Observability (metrics / health)YesLimitedPrometheus + health
Self-tuningn/aManual thresholdAutomatic
CPU-only, sub-millisecondYesVariesYes (0.2 ms)

Reuse is only the first requirement. Here is how Bud Cache compares with the alternatives an enterprise actually weighs — from no cache at all, to a generic exact-match key-value store, to a naive semantic cache.

Business impact

Lower cost, faster responses, a safety profile you can stand behind

Cut inference cost

Cost falls roughly one-for-one with the cache hit rate, with no change to answer quality.

Answer in sub-milliseconds

A served hit returns in ~0.2 ms versus the hundreds of milliseconds of a fresh model call.

Safe to deploy in production

A near-zero wrong-answer rate, backed by a statistical guarantee, makes it safe for customer-facing and regulated use.

Runs on CPUs, not GPUs

It cuts GPU demand without adding any of its own — better economics and a smaller energy footprint.

Enterprise & security

Production infrastructure — secure by default

Beyond accuracy and speed, Bud Cache ships with the security, multi-tenancy, data-governance and operability controls an enterprise needs — and turns them on by default.

01

Security & access control

  • Tenant-scoped API keys; separate admin key, compared in constant time
  • Network-safe defaults — binds to loopback until access control is configured
  • Per-caller rate limiting and request-size limits
02

Multi-tenancy & isolation

  • Strict per-tenant isolation — one tenant’s answers can never be reused for another
  • Verified with zero isolation violations in testing
  • Keys scoped per-tenant or organisation-wide
03

GDPR-ready governance

  • Right to erasure (Art. 17) — survives backup restoration
  • Automatic age-out of data and logs (Art. 5(1)(e))
  • Per-request never-store for sensitive traffic (Art. 5(1)(c))
04

Freshness & correctness

  • Configurable TTL per request, per domain and globally
  • Time-sensitive questions auto-given short lifetimes, never reused across a day boundary
  • Highly non-deterministic generations are not cached
  • Honours standard HTTP cache directives (no-store / no-cache / max-age / min-fresh) per request — RFC 9111 compliant, so existing caching policy carries straight over
05

Data protection at rest

  • Least-privilege, owner-only file permissions (0600)
  • Runs on encrypted volume or cloud KMS-backed disk
  • Application-level encryption seam for envelope encryption
06

Operability & reliability

  • Prometheus metrics, health endpoint, structured tracing
  • Zero-downtime restarts with snapshot and warm-start
  • Proven under shadow-replay, soak and chaos testing
Where it fits

Use cases

Platforms

LLM API platforms & gateways

Absorb redundant traffic across many customers, cutting GPU spend per request while guaranteeing answers are never mismatched across tenants.

Support

Customer support & FAQ assistants

The same questions recur endlessly in countless phrasings — the ideal high-redundancy, high-value workload for safe reuse.

Internal

Enterprise copilots & knowledge bots

Employees ask overlapping questions about policies, code and docs; reuse is high and correctness matters.

Retrieval

Retrieval-augmented (RAG) Q&A

Repeated questions over a stable corpus reuse cleanly; freshness rules keep answers current when the corpus changes.

Agents

Agentic / tool-using systems

Multi-step agents repeat sub-queries and planning steps; caching them shortens chains and cuts cost per task.

Consumer

High-traffic consumer chat

Popular prompts and trending questions are served instantly from cache, freeing capacity for the long tail.

FAQ

Common questions

No. Time-sensitive questions (prices, “today”, “latest”) are automatically given short lifetimes and are never reused across a day boundary. TTL is configurable per request, per domain and globally, and highly non-deterministic generations aren’t cached at all.

An exact-match cache only reuses byte-for-byte identical requests, so it almost never hits on natural language. Bud Cache reuses genuine paraphrases while holding the wrong-answer rate near zero — and ships with multi-tenant isolation, GDPR governance and observability that a generic key-value cache doesn’t provide for LLM answers.

No. The cache tier runs entirely on commodity CPUs — a served hit returns in ~0.2 ms on a single core. It reduces GPU demand without adding any of its own.

Volatile questions are automatically detected and given short lifetimes, never reused across a day boundary. Callers can also mark specific requests as never-store or must-revalidate.

Yes. Admin-gated erasure removes a single entry or an entire tenant’s data on demand, and erasure survives backup restoration — restoring an older backup never resurrects erased data. Stored data and logs also age out automatically on a configurable retention window.

GPTCache reuses on a single similarity threshold, so 68% of the cases it shouldn’t reuse get served wrongly. Bud Cache judges true equivalence rather than surface resemblance, reaching 92% accuracy at a 1.5% wrong-answer rate — and runs ~285× faster because it doesn’t run a transformer on every lookup.

Caching for LLMs that finally works

Ready to cut inference cost without the risk?

Talk to our team about putting an accuracy-first cache in front of your LLM — multi-tenant, secure by default, and deployable on commodity CPUs.