Real briefs with calibrated probabilities, graded sources, and source-fidelity audits.
The decision in front of you — actor, stakes, time horizon, what's at risk if it's wrong. A paragraph, not a search term. The more specific, the sharper the audit trail.
One agent writes the draft, citing live sources. A second agent — different vendor, separate model family — sits as director and attacks the conclusions. Iterations continue until the brief holds and every cited URL has been re-verified against the actual claim. Single-vendor tools can't do this.
Sources graded A/B/C. Claims marked High/Medium/Low confidence. Probabilities in calibrated bands. Counter-arguments tested. Blind spots listed. A source-fidelity audit verifying every numeric claim against its primary source. The same evidence shape an AI-decision insurer prices against.
A — primary: SEC filings, court records, peer-reviewed papers, official statistics, named regulators.
B — secondary credible: FT, Bloomberg, Reuters, established think tanks, named domain experts.
C — tertiary: blogs, single-source claims, opinion pieces, social media.
H — three or more A-grade sources agree, no contradicting evidence.
M — at least one A or two B-grade sources, no significant contradiction.
L — single source, contradicting evidence, or inferred indirectly.
Words map to ranges, not feelings. Almost Certain = 87–99%. Probable = 63–87%. Chances About Even = 40–60%. Probably Not = 13–37%. Almost Certainly Not = 1–13%. From intelligence-analysis tradition — every band auditable.
Shallow, deep, or hyper — depending on how much the wrong answer costs. Source-fidelity audit on every brief. One-off. Fits a real call you have right now.
Start a question →For investment committees, GC offices, and AI-risk leads running recurring decision-grade research. Every brief AIUC-1 / ISO 42001 / NIST AI RMF / EU AI Act Annex IV-mappable. Priority queue. Weekly digest across your active questions.
For carriers / MGAs writing affirmative AI cover, tier-1 banks with model-risk teams, Big-Law AI practices reselling to clients, and PE funds with multiple committees. Custom templates, per-policy evidence packs.
Contact →Decision-grade research has to be verifiable. We participate in the public DeepResearch Bench (DRB) protocol so our claims are independently scoreable — not marketing-grade self-assessment.
| System | Score | Mode | Source |
|---|---|---|---|
| Xiaoyi DeepResearch | 57.00 | research-class | DRB-II live, Apr 2026 |
| NVIDIA AI-Q | #1 DRB-II | research-class | NVIDIA blog + HF, Feb 2026 |
| Gemini 2.5 Pro Deep Research | 49.44 | research-class | DRB v1 paper, mid-2025 |
| OpenAI Deep Research | 47.14 | research-class | DRB v1 paper, mid-2025 |
| Aurolabs Quanto deep tier | 47.91 preliminary, n=6 | cross-vendor adversarial | DRB-II partial, May 2026 |
| Perplexity Deep Research | 44.28 | research-class | DRB v1 paper, mid-2025 |
| Claude-3.7 Sonnet w/Search | 41.46 | chat-class | DRB v1 paper, mid-2025 |
Status: pilot paused at n=6 due to Anthropic Max-sub rate-limiting. Engine retry-with-backoff + smarter template classifier deployed 2026-04-30. Resuming. Final scores publish in our blog when the run completes — alongside the methodology audit and per-task breakdown.