Introducing

Bud Cache

Accuracy-first LLM response caching powered by Resource Aware Attention

Reuse answers your model has already generated — safely, instantly and on ordinary CPUs — cutting inference cost and latency without ever serving a wrong answer.

Why Caching Matters and Where It Breaks

Why caching matters

Lower bills, faster answers, free GPUs

Stop paying twice for the same answer

Inference is the biggest recurring bill, and it scales with traffic. Much of that traffic is the same questions asked again.

Answer in milliseconds, not seconds

A fresh model call takes hundreds of milliseconds, sometimes seconds. A served cache hit comes back near-instantly.

Free up GPUs for genuinely new work

Redundant calls tie up scarce, expensive GPUs. Serving repeats from cache frees that capacity for genuinely new requests.

Why today's caches fall short

They give you safety or savings, never both

Exact-match caches almost never hit

They reuse only on byte-for-byte identical requests. The same question phrased differently looks brand-new, so the hit rate stays near zero.

Semantic caches serve wrong answers

They reuse on one similarity score. Look-alikes that differ in a number or a negation slip through as confidently wrong answers.

One threshold can't do safety and savings

One knob forces a trade-off: loosen it and wrong answers creep in; tighten it and the savings disappear.

What is Bud Cache

An accuracy-first response cache

When a request arrives, Bud Cache decides — in under a millisecond — whether it has already produced a trustworthy answer to a genuinely equivalent request. If so, it returns that answer instantly and the expensive model call is avoided. If not, the request passes through and the new answer is learned for next time.

92%
Decision accuracy
1.5%
Wrong-answer rate
79%
Useful hit rate
0.2ms
Median hit latency

Measured on the sealed CacheBench held-out set (400 pairs); latency on a single commodity CPU core. On the same benchmark, GPTCache 0.1.44 scores 52.6% accuracy and a 68.2% wrong-answer rate.

True equivalence, not surface resemblance

Instead of collapsing each request into one similarity number, Bud Cache focuses on the parts of a request that actually determine the answer — the specific entities, quantities, constraints, negations and intent — and judges true equivalence.

It is deliberately biased toward not reusing when in doubt: a missed reuse merely costs one model call, whereas a wrong reuse costs trust.

Powered by Resource Aware Attention

Bud’s proprietary equivalence engine

“Resource aware” because it delivers this precise judgement within a tightly bounded compute and memory budget — so it runs in real time on ordinary CPUs, rather than requiring its own GPUs to guard the GPUs it is protecting.

Entities Quantities Constraints Negations Intent
Typical semantic cache “Refund within 14 days?”
0.94

The whole request is crushed into one similarity number. “14 days” vs “30 days” barely moves it — so a wrong answer clears the threshold.

Verdict: reuse — on a look-alike, that is the wrong call.

Bud Cache — Resource Aware Attention “Refund within 14 days?”
Entities
Quantities
Constraints
Negations
Intent

The “14” vs “30” quantity mismatch is decisive — verdict: do not reuse, pass to the model.

The decision, in under a millisecond

How a request flows through Bud Cache

Every request is judged for true equivalence before any expensive model call is made.

Request arrives

An incoming request reaches Bud Cache, sitting in front of the model in the request path.

Resource Aware Attention

Judge true equivalence

In under a millisecond, decide whether a trustworthy answer to a genuinely equivalent request already exists.

Confident match — hit

Reuse the stored answer

The model’s own previous answer to an equivalent request is returned instantly — quality is unchanged.

~0.2 ms
No match — miss

Pass through, then learn

The request goes to the model as normal; its new answer is learned, so equivalent future requests are served from cache.

never a wrong answer

Built to be safe enough to deploy in production.

The capabilities that turn caching for LLMs from a risky optimisation into standard, trustworthy infrastructure.

01

Near-zero wrong-answer rate

Reuse answers without the risk of serving a confidently incorrect one — the safety property that makes caching deployable in customer-facing products.

02

High useful coverage

Reuses across genuine paraphrases and re-phrasings, not just identical strings — so savings are real, not theoretical.

03

Sub-millisecond, CPU-only

No GPU required for the cache tier and no meaningful latency added. Cheap to run and easy to place anywhere in the stack.

04

A statistical safety guarantee

The wrong-answer rate is held under a target you set — with a mathematical (PAC) bound, not just a hopeful threshold.

05

Self-tuning, zero-config

Calibrates itself to each workload and keeps the safety target on track automatically — no manual threshold tuning, no ML team required to operate it.

06

One-knob strictness

A single Lenient / Balanced / Strict setting trades coverage for caution to match the risk appetite of the use case; sensible defaults out of the box.

Business impact

Lower cost, faster responses, a safety profile you can stand behind.

Every safe cache hit is one model inference that never runs — so the savings are direct and proportional to the hit rate.

Lower cost

Directly and proportionally

Cost falls roughly one-for-one with the cache hit rate — with no change to answer quality, because reused answers are the model’s own previous answers to equivalent requests.

Faster

Orders of magnitude on a hit

A served hit returns in ~0.2 ms versus the hundreds-to-thousands of milliseconds of a fresh model call — and about 285× faster than a transformer-based semantic cache.

~7,500×
Safer

Actually deployable

Because the wrong-answer rate is held near zero with a statistical guarantee, Bud Cache can be switched on in customer-facing and regulated environments where a naive cache could not.

Greener & cheaper to operate

Commodity CPUs, not GPUs

The cache tier reduces GPU demand without adding GPU demand of its own — improving both the economics and the energy footprint of the overall system.

Production infrastructure — secure by default.

Beyond accuracy and speed, Bud Cache ships with the security, multi-tenancy, data-governance and operability controls an enterprise needs — and turns them on by default.

01

Security & access control

  • Tenant-scoped API keys; separate admin key, compared in constant time
  • Network-safe defaults — binds to loopback until access control is configured
  • Per-caller rate limiting and request-size limits
02

Multi-tenancy & isolation

  • Strict per-tenant isolation — one tenant’s answers can never be reused for another
  • Verified with zero isolation violations in testing
  • Keys scoped per-tenant or organisation-wide
03

GDPR-ready governance

  • Right to erasure (Art. 17) — survives backup restoration
  • Automatic age-out of data and logs (Art. 5(1)(e))
  • Per-request never-store for sensitive traffic (Art. 5(1)(c))
04

Freshness & correctness

  • Configurable TTL per request, per domain and globally
  • Time-sensitive questions auto-given short lifetimes, never reused across a day boundary
  • Highly non-deterministic generations are not cached
05

Data protection at rest

  • Least-privilege, owner-only file permissions (0600)
  • Runs on encrypted volume or cloud KMS-backed disk
  • Application-level encryption seam for envelope encryption
06

Operability & reliability

  • Prometheus metrics, health endpoint, structured tracing
  • Zero-downtime restarts with snapshot and warm-start
  • Proven under shadow-replay, soak and chaos testing

Where Bud Cache fits.

Any workload where the same questions recur in countless phrasings is a high-redundancy, high-value candidate for safe reuse.

Platforms

LLM API platforms & gateways

Absorb redundant traffic across many customers, cutting GPU spend per request while guaranteeing answers are never mismatched across tenants.

Support

Customer support & FAQ assistants

The same questions recur endlessly in countless phrasings — the ideal high-redundancy, high-value workload for safe reuse.

Internal

Enterprise copilots & knowledge bots

Employees ask overlapping questions about policies, code and docs; reuse is high and correctness matters.

Retrieval

Retrieval-augmented (RAG) Q&A

Repeated questions over a stable corpus reuse cleanly; freshness rules keep answers current when the corpus changes.

Agents

Agentic / tool-using systems

Multi-step agents repeat sub-queries and planning steps; caching them shortens chains and cuts cost per task.

Consumer

High-traffic consumer chat

Popular prompts and trending questions are served instantly from cache, freeing capacity for the long tail.

Head-to-head, measured

Bud Cache vs. GPTCache 0.1.44, on identical CacheBench protocol.

GPTCache 0.1.44 Bud Cache
Accuracyhigher is better
52.6%
92.0%
Wrong-answer ratelower is better
68.2%
1.5%
Adversarial wrong-answer ratelower is better
96.3%
2.4%
Coverage (recall)higher is better
93.3%
79.4%
~285× faster Latency per request GPTCache ~57 ms (transformer embedder)  ·  Bud Cache 0.2 ms (commodity CPU, no GPU)
Metric (CacheBench) GPTCache 0.1.44 Bud Cache Advantage
Accuracy 52.6% 92.0% +39 points
Wrong-answer rate (false-hit) 68.2% 1.5% ~45× lower
Adversarial wrong-answer rate 96.3% 2.4% ~40× lower
Coverage (recall) 93.3% 79.4% see note
Latency per request ~57 ms 0.2 ms ~285× faster
Cache-tier hardware transformer embedder commodity CPU no GPU needed

GPTCache’s higher raw coverage is misleading: it reuses so aggressively that 68% of the cases it should not reuse are served wrongly. Bud Cache reuses slightly less but is almost never wrong — the trade a production system actually wants. Any cache that reuses on a single similarity threshold shares the same profile.

Caching for LLMs that finally works

Ready to cut inference cost without the risk?

Talk to our team about putting an accuracy-first cache in front of your LLM — multi-tenant, secure by default, and deployable on commodity CPUs.