We shaped evaluation cases around the failure modes that matter for Agentic RAG. The useful tests are not only "can it answer a single question"; they need to show what happens when retrieval and reasoning interact.
Case types
- Follow-up questions where prior context changes the retrieval target.
- Complex questions that should trigger decomposition.
- Direct questions that should avoid unnecessary decomposition.
- Weak-evidence cases where the system should avoid confident synthesis.
- Source-targeting cases where the router should avoid irrelevant corpora.
No benchmark claim
This is evaluation harness work, not a published accuracy result. Accuracy claims require ground truth, reproducible runs, and documented failures.
Engineering note
The evaluation direction favors failure visibility over inflated pass rates. If evidence is missing, the expected behavior is to surface that condition.


