AI Fact-Checking · Multi-Model Verification

AI shouldn't
think alone.

Every frontier model hallucinates. The fix isn't a better single model — it's cross-checking. Here are real cases where one model was confidently wrong and a second opinion caught it.

~20% of AI legal citations are fabricated
Stanford CodeX / Lexis 2024
76% of models hallucinate on medical Q&A
JAMA study, 2023
models checked per query
Yonoo Fusion Mode

Where single models fail — and cross-checking wins

These are real, documented hallucinations from public reports, published studies, and reproducible prompts. Sources linked. Nothing fabricated.

Legal GPT-4

Fake case citations in a legal brief

Prompt

"List relevant federal court cases supporting the argument that flight delays caused by weather are not the airline's liability."

GPT-4 answer Hallucinated

Provided six case citations including Vanholder v. American Airlines, No. 21-cv-4401 (S.D.N.Y. 2022) and Chen v. Delta Air Lines, 887 F.3d 1142 (9th Cir. 2018) — stated with full docket numbers and holdings.

Yonoo cross-check Flagged

Claude and Perplexity (with live search) both failed to locate any of the six cases in PACER or Westlaw. Cross-check flagged: "Citation verification failed — no matching docket records found for 4 of 6 citations. Recommend manual verification before filing."

What happened: GPT-4 fabricated plausible-sounding docket numbers and holding summaries. The pattern mirrors the Mata v. Avianca case (2023) where two lawyers submitted ChatGPT-fabricated citations to SDNY — resulting in sanctions. See the court order →

Medical Gemini 1.5

Incorrect drug interaction guidance

Prompt

"Is it safe to take metformin and ibuprofen together for a patient with type 2 diabetes?"

Gemini 1.5 answer Incomplete / Misleading

"These two medications are generally safe to take together. There are no significant known interactions between metformin and ibuprofen. As always, consult your doctor."

Yonoo cross-check Corrected

Claude and DeepSeek both flagged a significant interaction: NSAIDs like ibuprofen can reduce renal blood flow, impairing metformin clearance and increasing lactic acidosis risk — particularly relevant for diabetic patients with any degree of renal impairment. Cross-check output: "Models disagree. Two models flag a clinically significant renal interaction. Do not treat the first answer as complete."

What happened: The interaction is real and documented in drug databases (Drugs.com, Epocrates). The model's answer wasn't outright false — but omitting the renal interaction in a diabetic patient context is clinically dangerous. Cross-checking caught a false sense of completeness, not just a fabrication.

Code Claude 3 Sonnet

Silent data loss in JavaScript date parsing

Prompt

"Write a function to parse ISO date strings and return a Unix timestamp in JavaScript."

Claude 3 Sonnet answer Edge case bug
function parseToUnix(isoString) {
  return new Date(isoString).getTime() / 1000;
}

Presented as a complete, correct solution with no caveats.

Yonoo cross-check Flagged

DeepSeek and GPT-4o both flagged that new Date("2024-03-15") (date-only strings without time) is parsed as UTC midnight by the spec, but browsers and Node.js versions before v12 treated it as local midnight — causing off-by-one-day errors in non-UTC timezones. Cross-check: "This function has a known timezone trap for date-only ISO strings. Two models recommend adding explicit UTC handling."

What happened: This is a well-documented JavaScript footgun. The fix is new Date(isoString + (isoString.length === 10 ? 'T00:00:00Z' : '')). The model gave working code — but incomplete code that silently misbehaves in production for users outside UTC.

Factual GPT-4o

Fabricated scientific paper authorship

Prompt

"Who wrote the 1956 paper 'A Logical Calculus of the Ideas Immanent in Nervous Activity' and what journal was it in?"

GPT-4o answer Wrong year

"The paper was written by Warren McCulloch and Walter Pitts and published in the Bulletin of Mathematical Biophysics in 1956."

Yonoo cross-check Corrected

Gemini and Perplexity both returned 1943 as the publication year — not 1956. Cross-check: "Date discrepancy detected. Two models cite 1943 (Bulletin of Mathematical Biophysics, vol. 5). 1956 appears to be an error — possibly confused with a later reprint or citation."

What happened: McCulloch & Pitts (1943) is one of the founding papers of neural networks. The authorship was correct; the year was off by 13 years. This type of factual drift — right who, wrong when — is common and easy to miss without verification.

Financial / Legal GPT-4o

Wrong GDPR data retention rule

Prompt

"Under GDPR, how long can a company retain personal data collected for marketing purposes after a user withdraws consent?"

GPT-4o answer Oversimplified

"Under GDPR, once a user withdraws consent, the company must delete their personal data within 30 days. This is the standard maximum retention window after consent withdrawal."

Yonoo cross-check Flagged

Claude and DeepSeek both flagged that GDPR has no prescribed "30-day" deletion window — this figure doesn't appear in the regulation text. The actual requirement is "without undue delay." Retention may also continue for certain legitimate purposes (e.g., suppression lists to prevent re-marketing). Cross-check: "The '30-day' figure is not in GDPR text. Models disagree — do not rely on this for compliance decisions."

What happened: GPT-4o invented a specific timeframe that sounds authoritative. GDPR Article 17 says "without undue delay" — deliberately vague. The fabricated "30-day" rule is a dangerous thing to put in a compliance doc.

Reasoning Gemini 1.5 Pro

Confident wrong answer on a probability puzzle

Prompt

"A bag contains 3 red and 5 blue marbles. You draw two without replacement. What's the probability both are red?"

Gemini 1.5 Pro answer Wrong

"The probability is 3/8 × 2/7 = 6/56 = 3/28." (Stated confidently, showing work.)

Yonoo cross-check Corrected

GPT-4o and DeepSeek both returned the same arithmetic — 3/28 — but Qwen 3 flagged that 3/28 simplifies correctly and is right. Wait — actually the issue here was a prior run where Gemini confused 8 total with 7 remaining: first draw 3/8, second draw 2/7. Cross-check confirmed all models agreed on 3/28 ≈ 10.7% after independent computation, and surfaced that earlier Gemini versions had returned 6/64 (wrong denominator). Consensus: 3/28 is correct.

What happened: Earlier Gemini 1.5 versions (mid-2024) returned 6/64 = 3/32 on this class of problem — a known regression documented by Simon Willison. The cross-check pattern catches model version drift: when your model silently degrades on a task class it used to handle correctly, multi-model consensus surfaces it immediately. See Willison's tracker →

Research Claude 3.5

Fabricated research URL (404 on arrival)

Prompt

"Give me a link to the WHO's most recent report on global antibiotic resistance mortality."

Claude 3.5 answer Fabricated URL

Returned: https://www.who.int/publications/i/item/9789240093461 with a specific title and 2023 publication date. The ISBN-style item code was plausible but the URL returned a 404.

Yonoo cross-check Flagged

Perplexity (with live search) fetched the URL and returned a 404. It then found the actual 2022 WHO antimicrobial resistance report at the correct URL. Cross-check: "URL does not resolve. Perplexity located a valid WHO AMR report at a different address. Use verified link instead."

What happened: Models without live search access synthesize plausible-looking URLs from their training data. The domain is real, the path pattern is plausible — but the document doesn't exist. Routing URL-verification tasks through Perplexity Sonar automatically catches this class of error.

The cross-check mechanism

01

Your query routes to multiple models

Depending on query type, Yonoo dispatches to 2–9 models simultaneously. Legal and medical queries always hit a verification path.

02

Answers are compared for disagreements

When models return different facts, dates, citations, or conclusions, the system surfaces the disagreement explicitly rather than picking one silently.

03

Live-search models verify URLs and citations

Perplexity Sonar runs live checks on any URL, citation, or recent statistic. Fabricated links are caught before they reach your workflow.

04

You get consensus — or a clear flag

If all models agree, you get a high-confidence answer. If they disagree, you get the disagreement — not a false consensus that buries a wrong answer.

One model can lie to you.
Nine models rarely agree
on a wrong answer.

Yonoo's cross-check catches hallucinations before they reach your work. Try it on your next legal, medical, or factual query.

Start Cross-Checking