Accuracy-first LLM response caching powered by Resource Aware Attention
Reuse answers your model has already generated — safely, instantly and on ordinary CPUs — cutting inference cost and latency without ever serving a wrong answer.
Inference is the biggest recurring bill, and it scales with traffic. Much of that traffic is the same questions asked again.
A fresh model call takes hundreds of milliseconds, sometimes seconds. A served cache hit comes back near-instantly.
Redundant calls tie up scarce, expensive GPUs. Serving repeats from cache frees that capacity for genuinely new requests.
They reuse only on byte-for-byte identical requests. The same question phrased differently looks brand-new, so the hit rate stays near zero.
They reuse on one similarity score. Look-alikes that differ in a number or a negation slip through as confidently wrong answers.
One knob forces a trade-off: loosen it and wrong answers creep in; tighten it and the savings disappear.
When a request arrives, Bud Cache decides — in under a millisecond — whether it has already produced a trustworthy answer to a genuinely equivalent request. If so, it returns that answer instantly and the expensive model call is avoided. If not, the request passes through and the new answer is learned for next time.
Measured on the sealed CacheBench held-out set (400 pairs); latency on a single commodity CPU core. On the same benchmark, GPTCache 0.1.44 scores 52.6% accuracy and a 68.2% wrong-answer rate.
Instead of collapsing each request into one similarity number, Bud Cache focuses on the parts of a request that actually determine the answer — the specific entities, quantities, constraints, negations and intent — and judges true equivalence.
It is deliberately biased toward not reusing when in doubt: a missed reuse merely costs one model call, whereas a wrong reuse costs trust.
“Resource aware” because it delivers this precise judgement within a tightly bounded compute and memory budget — so it runs in real time on ordinary CPUs, rather than requiring its own GPUs to guard the GPUs it is protecting.
The whole request is crushed into one similarity number. “14 days” vs “30 days” barely moves it — so a wrong answer clears the threshold.
Verdict: reuse — on a look-alike, that is the wrong call.
The “14” vs “30” quantity mismatch is decisive — verdict: do not reuse, pass to the model.
Every request is judged for true equivalence before any expensive model call is made.
An incoming request reaches Bud Cache, sitting in front of the model in the request path.
In under a millisecond, decide whether a trustworthy answer to a genuinely equivalent request already exists.
The model’s own previous answer to an equivalent request is returned instantly — quality is unchanged.
The request goes to the model as normal; its new answer is learned, so equivalent future requests are served from cache.
The capabilities that turn caching for LLMs from a risky optimisation into standard, trustworthy infrastructure.
Reuse answers without the risk of serving a confidently incorrect one — the safety property that makes caching deployable in customer-facing products.
Reuses across genuine paraphrases and re-phrasings, not just identical strings — so savings are real, not theoretical.
No GPU required for the cache tier and no meaningful latency added. Cheap to run and easy to place anywhere in the stack.
The wrong-answer rate is held under a target you set — with a mathematical (PAC) bound, not just a hopeful threshold.
Calibrates itself to each workload and keeps the safety target on track automatically — no manual threshold tuning, no ML team required to operate it.
A single Lenient / Balanced / Strict setting trades coverage for caution to match the risk appetite of the use case; sensible defaults out of the box.
Every safe cache hit is one model inference that never runs — so the savings are direct and proportional to the hit rate.
Cost falls roughly one-for-one with the cache hit rate — with no change to answer quality, because reused answers are the model’s own previous answers to equivalent requests.
A served hit returns in ~0.2 ms versus the hundreds-to-thousands of milliseconds of a fresh model call — and about 285× faster than a transformer-based semantic cache.
Because the wrong-answer rate is held near zero with a statistical guarantee, Bud Cache can be switched on in customer-facing and regulated environments where a naive cache could not.
The cache tier reduces GPU demand without adding GPU demand of its own — improving both the economics and the energy footprint of the overall system.
Beyond accuracy and speed, Bud Cache ships with the security, multi-tenancy, data-governance and operability controls an enterprise needs — and turns them on by default.
Any workload where the same questions recur in countless phrasings is a high-redundancy, high-value candidate for safe reuse.
Absorb redundant traffic across many customers, cutting GPU spend per request while guaranteeing answers are never mismatched across tenants.
The same questions recur endlessly in countless phrasings — the ideal high-redundancy, high-value workload for safe reuse.
Employees ask overlapping questions about policies, code and docs; reuse is high and correctness matters.
Repeated questions over a stable corpus reuse cleanly; freshness rules keep answers current when the corpus changes.
Multi-step agents repeat sub-queries and planning steps; caching them shortens chains and cuts cost per task.
Popular prompts and trending questions are served instantly from cache, freeing capacity for the long tail.
| Metric (CacheBench) | GPTCache 0.1.44 | Bud Cache | Advantage |
|---|---|---|---|
| Accuracy | 52.6% | 92.0% | +39 points |
| Wrong-answer rate (false-hit) | 68.2% | 1.5% | ~45× lower |
| Adversarial wrong-answer rate | 96.3% | 2.4% | ~40× lower |
| Coverage (recall) | 93.3% | 79.4% | see note |
| Latency per request | ~57 ms | 0.2 ms | ~285× faster |
| Cache-tier hardware | transformer embedder | commodity CPU | no GPU needed |
GPTCache’s higher raw coverage is misleading: it reuses so aggressively that 68% of the cases it should not reuse are served wrongly. Bud Cache reuses slightly less but is almost never wrong — the trade a production system actually wants. Any cache that reuses on a single similarity threshold shares the same profile.
Talk to our team about putting an accuracy-first cache in front of your LLM — multi-tenant, secure by default, and deployable on commodity CPUs.