2026-04-29 · 9 min · Aurolabs

Six AIs, one question: which ones earn the word 'research'?

We asked the same question to Google AI Overview, Google AI Mode, Perplexity, ChatGPT 5.5 Thinking, ChatGPT Agent, and our own engine. Then we audited our own answer against its sources — and found one mistake. Here's what we learned about the gap between an answer and a defensible answer.

The question

Deep research: Are vibe-coding companies sustainable without continuous VC funding?

We picked it because it's the kind of question we hear from buyers — partners at growth funds, corporate strategists, founders thinking about category economics. There's a real answer in the data. There's also a lot of motivated reasoning in the discourse. A good research engine should separate the two.

We typed the exact same question into six AI systems on the same afternoon. Free tiers, where they exist — that's what most readers use. We saved every answer, took screenshots, and then did something none of these systems do for you: we re-fetched each cited source URL in our own report and verified that the specific numbers actually appeared there.

What we found is a useful map of where the deep research label actually means something — and where it's marketing.


The lineup

SystemModeLatencyStack
Google AI OverviewEmbedded snippet in regular search3 secGemini
Google AI ModeConversational sidebar5–15 secGemini
Perplexity (free)Direct-answer + sources10–30 secMulti-model
ChatGPT 5.5 ThinkingVisible thinking + answer10 secGPT-5.5
ChatGPT AgentLong-running tool-useseveral minutesGPT + browsing
Quanto (deep tier)Adversarial cross-vendor loop37 minutesClaude + GPT-5.4 director

The first three are search-class: built to answer in seconds. They're great for "what year did Lovable launch?" — bad for anything where being wrong is expensive.

ChatGPT 5.5 Thinking is reasoning-class: pretty thinking but no real-world tool use beyond what's in the model.

ChatGPT Agent and Quanto are the only two research-class systems in the lineup — both run for minutes, both browse the live web, both build something longer than a chat reply. Comparing those two is where the interesting fight lives.


What the quick answers said

Google AI Overview hedged: vibe-coding companies are "generally not sustainable in the long term without continuous VC funding or rapid transition to a conventional SaaS model." It cites Medium essays, Baytech Consulting blog posts, and AIM Media House — content marketing, not primary sources. Zero company names with revenue numbers.

Google AI Mode went further. It cited specific companies (Lovable, Cursor, Anysphere, Wix-acquired Base44) with specific numbers: "Lovable reportedly reached $100M ARR within eight months and $400M ARR by early 2026, amassing 30,000 paying customers while keeping costs as low as $2 million." Concrete — and partly wrong, as we'll see.

Perplexity had something neither Google variant did: an admission that the answer depends on context. It described "scenarios A/B/C" rather than a single take. But it named zero companies. Zero ARRs. Zero specific numbers. A framework without empiry is a lecture, not research.

"The sustainability of vibe-coding companies without ongoing VC funding is uncertain and highly context-dependent, but there are plausible paths if they achieve strong product-market fit, high gross margins, and scalable unit economics."

— Perplexity

That's true the way "exercise is good" is true. It's also useless if you're sitting in an investment committee.


What the long-form answers said

ChatGPT 5.5 Thinking opened with the most quotable line of the day:

"Vibe-coding demand is real. Vibe-coding valuations are not automatically real."

It then named Cursor at "over $500M ARR" — twelve months out of date, as Cursor crossed $2B ARR in March 2026. It cited a $13/active developer day cost figure. It built a clean narrative around five archetypes that survive ("developer control layer," "enterprise app factory," "vertical workflow builder," "model/runtime advantage," "distribution lock-in"). It didn't cite a single primary corporate filing.

ChatGPT Agent went deeper. It read like a journal article — abstract, three core questions, sub-sections on revenue, costs, and long-term outlook. It cited Reuters reporting on negative gross margins at Cursor and Windsurf. It pulled in IEA data on data-center electricity, Microsoft's 2025 Sustainability Report on water use, a Failing Fast cost analysis, a Barrack AI cost breakdown, a Palo Alto Networks security analysis. It wove in the environmental cost angle no other system raised. It concluded:

"Evidence from multiple sources shows that vibe-coding companies are not currently sustainable without continuous VC funding... only startups with the capital to build proprietary models and invest in governance layers may survive."

This is the answer that comes closest to ours. It also doesn't grade its sources, doesn't track confidence per claim, doesn't enumerate blind spots, and never explicitly says "here is what we couldn't verify."


What our engine produced

Read the full Quanto report →

A 16,000-word brief with 165 cited sources, four calibrated probability scenarios summing to 100%, an A/B/C source-quality audit, an internal-consistency verification section, a list of blind spots, and a single self-aware sentence we want to highlight:

"No A1 source corroborates the aggregate; this remains the report's single weakest central claim."

That sentence is the difference. Quanto knows which of its own claims are weak and tells you. None of the other five systems do.

The bottom line, disaggregated:

Each band is paired with a Sherman Kent calibration term. Each scenario has named triggers and cascade chains. The whole thing reads like an intelligence brief — because that's what the engine was built for.


The diffs that matter

We pulled out five concrete places where the answers materially diverge. Each one is the kind of thing that ends arguments in the room where decisions get made.

1. ChatGPT's Cursor ARR is twelve months stale.

ChatGPT 5.5 Thinking: "Cursor claimed over $500M ARR." Quanto: $2B ARR (March 2026, Bloomberg/TechCrunch triangulated). Anyone betting on a $500M-ARR company is betting on a different company than the $2B one.

2. Google's Lovable math doesn't reconcile.

Google AI Mode: "$400M ARR + 30,000 paying customers + costs as low as $2M." Bloomberg confirms the $400M ARR. The other two figures aren't on the cited Bloomberg page. $400M revenue with $2M cost is a 200× ratio that no real software company has ever sustained — Lovable's actual gross margins are not public, and Quanto correctly omits the fabricated cost claim.

3. Perplexity has zero company names.

Quanto names ten with disclosed ARR (Cursor, Claude Code, GitHub Copilot, Replit, Lovable, Cognition, Bolt, v0, Anysphere, Windsurf). The difference between "vibe-coding companies" and "Cursor at $2B ARR with negative 30% gross margin recovering to slight profit after Composer 2 launched in March 2026" is the difference between a sketch and a map.

4. Nobody but Quanto flagged Composer 2.

Cursor's proprietary model launched in March 2026 at roughly $0.50 per million input tokens — about ten times cheaper than Claude Opus 4.6. It moved Cursor's gross margin from "catastrophic" to "slightly positive" for the first time. None of the other five systems mentioned it. It's the most important change in the category in the last year.

5. Source quality is not graded anywhere else.

Quanto tags every claim A1 (primary disclosure: SEC filings, court records, corporate earnings), A2 (top-tier press: Bloomberg, Reuters, FT), B (analyst databases: Sacra, Contrary), or C (tertiary blogs). Google AI Overview cites Medium and Baytech Consulting — that's C-grade content marketing in our taxonomy. ChatGPT cites SaaStr (B). Quanto leans on Microsoft FY26 Q2 earnings (A1), Stack Overflow's 2025 Developer Survey of 49,000+ respondents (A1), and JetBrains' April 2026 workplace research (A1).

When the wrong answer is expensive, A-grade beats C-grade.


Then we audited our own report

This is the part that matters most.

After publishing the Quanto report, we ran a separate verification pass: re-fetch every cited URL, then ask a fact-checking model whether the specific numbers in the claim actually appear in the source text.

We caught one mistake. Our report originally said developer trust in AI accuracy "fell from ~43% (2024) to 33% in 2025." The 33%, 46% distrust, and 3% high-trust figures are correct — they're in the Stack Overflow 2025 Developer Survey (n=49,000+). But the 43% baseline isn't in either cited source. Stack Overflow's blog post on the same survey actually reports a different metric: 40% (2024) → 29% (2025), an eleven-point drop.

The writer model conflated two different "trust" metrics across the survey page and the blog post. The director-review loop didn't catch it, because the director reviews output quality — structure, calibration, logic — not whether each citation actually says what's claimed.

We corrected the report in place. We also built the audit pass into the engine itself. Going forward, every Quanto report runs through a source-fidelity check before final emit. The audit re-fetches each URL, parses out the relevant text, and asks: does the specific number in the claim appear here, verbatim or near-verbatim? Each claim gets one of four verdicts: VERIFIED, PARTIAL, UNVERIFIED, CONTRADICTED. The audit cost is a few cents per report and adds about three minutes to the wall-clock. Worth it.

This is now the only research engine we know of that systematically verifies its own cited sources before delivery.

We thought about hiding the mistake. We decided against it. If we want buyers to trust output that costs three hundred euros, we have to behave like the kind of analyst that would tell them about a flagged claim before they read it in a footnote later. That's the standard.


What this changes for you

The cost of an answer is not just the price tag.

When a buyer pays €20 for ChatGPT Plus and runs the same question we ran, they get an answer in 10 seconds. Then they spend their own time:

That's three to four hours of a senior person's time per question. At any reasonable internal billing rate, that's well over €1,000 in human cost — on top of the €20 SaaS bill — to do work the AI didn't actually finish.

A Quanto report at the deep tier is €199. It comes with sources graded A/B/C, claims tagged High/Medium/Low confidence, probabilities in calibrated bands, blind spots enumerated, internal contradictions reconciled, and now a source-fidelity audit attached. Every citation has been re-fetched and verified before you read it.

You're not buying a faster chat. You're buying the four hours back.


When this isn't for you

To be clear: if your question is "what was the box-office for the latest Marvel film," you don't need us. Use Google.

If your question is "how does our team handle this Slack thread," you don't need us. Use Microsoft Copilot.

If your question is "what's the unit economics of vibe-coding companies after Composer 2 changed the gross-margin picture, and what's the probability one of them survives independently to 2030?" — that's the kind of question the engine was built for.

You'll know when you have one.


Try it on your own question

Ask Quanto a question →

Read the full report we ran for this post →

We rebuilt our own audit loop because of this comparison. Every report from this point forward gets the same treatment we just gave ours.

Try Quanto on a question that matters.
Start a brief →