We added QA audit and benchmark tooling around RAG behavior. Accuracy claims require reproducible inputs, expected outputs, and documented failure cases.
This update does not publish benchmark results. It adds the harness needed to measure behavior without relying on subjective demos.
Evaluation dimensions
The harness direction focuses on:
- retrieval coverage,
- evidence relevance,
- answer grounding,
- follow-up stability,
- empty-context behavior,
- and failure transparency.
No metric laundering
Benchmarks without methodology are invalid. A useful result needs the task set, environment, expected answers, scoring rules, and failure notes.
Why we added it
Agentic RAG can look good in a hand-picked conversation and still fail on repeatable cases. The harness is there to make repeated evaluation possible.


