Caching is necessary for low-latency Agentic RAG. It is also dangerous if it ignores context.
Two questions can look similar and still require different evidence. A cached result is useful only when the system can prove the retrieval state is compatible.
Cache layers worth separating
| Cache layer | What it reduces | Main risk |
|---|---|---|
| Fingerprint cache | Repeated equivalent work | Stale reuse |
| Semantic cache | Similar query handling | Context mismatch |
| Embedding cache | Repeated embedding calls | Input drift |
| Decomposition cache | Repeated planning | Wrong sub-query reuse |
Latency rule
Every millisecond saved must preserve the grounding path. Fast ungrounded answers are not a win.
Context isolation
Context isolation is the key constraint. The cache must not treat two prompts as equivalent when the surrounding conversation changes the referent, source, or evidence requirement.
What not to claim
We do not publish cache hit-rate claims or latency numbers without measurement. The public point is the implementation boundary: caching exists to reduce repeated work, not to bypass evidence selection.


