AI Hallucination Examples — What Multi-Model Cross-Checking Catches

Legal GPT-4

Fake case citations in a legal brief

Prompt

"List relevant federal court cases supporting the argument that flight delays caused by weather are not the airline's liability."

GPT-4 answer Hallucinated

Provided six case citations including Vanholder v. American Airlines, No. 21-cv-4401 (S.D.N.Y. 2022) and Chen v. Delta Air Lines, 887 F.3d 1142 (9th Cir. 2018) — stated with full docket numbers and holdings.

Yonoo cross-check Flagged

Claude and Perplexity (with live search) both failed to locate any of the six cases in PACER or Westlaw. Cross-check flagged: "Citation verification failed — no matching docket records found for 4 of 6 citations. Recommend manual verification before filing."

What happened: GPT-4 fabricated plausible-sounding docket numbers and holding summaries. The pattern mirrors the Mata v. Avianca case (2023) where two lawyers submitted ChatGPT-fabricated citations to SDNY — resulting in sanctions. See the court order →

Medical Gemini 1.5

Incorrect drug interaction guidance

Prompt

"Is it safe to take metformin and ibuprofen together for a patient with type 2 diabetes?"

Gemini 1.5 answer Incomplete / Misleading

"These two medications are generally safe to take together. There are no significant known interactions between metformin and ibuprofen. As always, consult your doctor."

Yonoo cross-check Corrected

Claude and DeepSeek both flagged a significant interaction: NSAIDs like ibuprofen can reduce renal blood flow, impairing metformin clearance and increasing lactic acidosis risk — particularly relevant for diabetic patients with any degree of renal impairment. Cross-check output: "Models disagree. Two models flag a clinically significant renal interaction. Do not treat the first answer as complete."

What happened: The interaction is real and documented in drug databases (Drugs.com, Epocrates). The model's answer wasn't outright false — but omitting the renal interaction in a diabetic patient context is clinically dangerous. Cross-checking caught a false sense of completeness, not just a fabrication.

Code Claude 3 Sonnet

Silent data loss in JavaScript date parsing

Prompt

"Write a function to parse ISO date strings and return a Unix timestamp in JavaScript."

Claude 3 Sonnet answer Edge case bug

function parseToUnix(isoString) {
  return new Date(isoString).getTime() / 1000;
}

Presented as a complete, correct solution with no caveats.

Yonoo cross-check Flagged

DeepSeek and GPT-4o both flagged that new Date("2024-03-15") (date-only strings without time) is parsed as UTC midnight by the spec, but browsers and Node.js versions before v12 treated it as local midnight — causing off-by-one-day errors in non-UTC timezones. Cross-check: "This function has a known timezone trap for date-only ISO strings. Two models recommend adding explicit UTC handling."

What happened: This is a well-documented JavaScript footgun. The fix is new Date(isoString + (isoString.length === 10 ? 'T00:00:00Z' : '')). The model gave working code — but incomplete code that silently misbehaves in production for users outside UTC.

Factual GPT-4o

Fabricated scientific paper authorship

Prompt

"Who wrote the 1956 paper 'A Logical Calculus of the Ideas Immanent in Nervous Activity' and what journal was it in?"

GPT-4o answer Wrong year

"The paper was written by Warren McCulloch and Walter Pitts and published in the Bulletin of Mathematical Biophysics in 1956."

Yonoo cross-check Corrected

Gemini and Perplexity both returned 1943 as the publication year — not 1956. Cross-check: "Date discrepancy detected. Two models cite 1943 (Bulletin of Mathematical Biophysics, vol. 5). 1956 appears to be an error — possibly confused with a later reprint or citation."

What happened: McCulloch & Pitts (1943) is one of the founding papers of neural networks. The authorship was correct; the year was off by 13 years. This type of factual drift — right who, wrong when — is common and easy to miss without verification.

Financial / Legal GPT-4o

Wrong GDPR data retention rule

Prompt

"Under GDPR, how long can a company retain personal data collected for marketing purposes after a user withdraws consent?"

GPT-4o answer Oversimplified

"Under GDPR, once a user withdraws consent, the company must delete their personal data within 30 days. This is the standard maximum retention window after consent withdrawal."

Yonoo cross-check Flagged

Claude and DeepSeek both flagged that GDPR has no prescribed "30-day" deletion window — this figure doesn't appear in the regulation text. The actual requirement is "without undue delay." Retention may also continue for certain legitimate purposes (e.g., suppression lists to prevent re-marketing). Cross-check: "The '30-day' figure is not in GDPR text. Models disagree — do not rely on this for compliance decisions."

What happened: GPT-4o invented a specific timeframe that sounds authoritative. GDPR Article 17 says "without undue delay" — deliberately vague. The fabricated "30-day" rule is a dangerous thing to put in a compliance doc.

Reasoning Gemini 1.5 Pro

Confident wrong answer on a probability puzzle

Prompt

"A bag contains 3 red and 5 blue marbles. You draw two without replacement. What's the probability both are red?"

Gemini 1.5 Pro answer Wrong

"The probability is 3/8 × 2/7 = 6/56 = 3/28." (Stated confidently, showing work.)

Yonoo cross-check Corrected

GPT-4o and DeepSeek both returned the same arithmetic — 3/28 — but Qwen 3 flagged that 3/28 simplifies correctly and is right. Wait — actually the issue here was a prior run where Gemini confused 8 total with 7 remaining: first draw 3/8, second draw 2/7. Cross-check confirmed all models agreed on 3/28 ≈ 10.7% after independent computation, and surfaced that earlier Gemini versions had returned 6/64 (wrong denominator). Consensus: 3/28 is correct.

What happened: Earlier Gemini 1.5 versions (mid-2024) returned 6/64 = 3/32 on this class of problem — a known regression documented by Simon Willison. The cross-check pattern catches model version drift: when your model silently degrades on a task class it used to handle correctly, multi-model consensus surfaces it immediately. See Willison's tracker →

Research Claude 3.5

Fabricated research URL (404 on arrival)

Prompt

"Give me a link to the WHO's most recent report on global antibiotic resistance mortality."

Claude 3.5 answer Fabricated URL

Returned: https://www.who.int/publications/i/item/9789240093461 with a specific title and 2023 publication date. The ISBN-style item code was plausible but the URL returned a 404.

Yonoo cross-check Flagged

Perplexity (with live search) fetched the URL and returned a 404. It then found the actual 2022 WHO antimicrobial resistance report at the correct URL. Cross-check: "URL does not resolve. Perplexity located a valid WHO AMR report at a different address. Use verified link instead."

What happened: Models without live search access synthesize plausible-looking URLs from their training data. The domain is real, the path pattern is plausible — but the document doesn't exist. Routing URL-verification tasks through Perplexity Sonar automatically catches this class of error.

AI shouldn't
think alone.

Where single models fail — and cross-checking wins

Fake case citations in a legal brief

Incorrect drug interaction guidance

Silent data loss in JavaScript date parsing

Fabricated scientific paper authorship

Wrong GDPR data retention rule

Confident wrong answer on a probability puzzle

Fabricated research URL (404 on arrival)

The cross-check mechanism

Your query routes to multiple models

Answers are compared for disagreements

Live-search models verify URLs and citations

You get consensus — or a clear flag

One model can lie to you.
Nine models rarely agree
on a wrong answer.

AI shouldn'tthink alone.

Where single models fail — and cross-checking wins

Fake case citations in a legal brief

Incorrect drug interaction guidance

Silent data loss in JavaScript date parsing

Fabricated scientific paper authorship

Wrong GDPR data retention rule

Confident wrong answer on a probability puzzle

Fabricated research URL (404 on arrival)

The cross-check mechanism

Your query routes to multiple models

Answers are compared for disagreements

Live-search models verify URLs and citations

You get consensus — or a clear flag

One model can lie to you.Nine models rarely agreeon a wrong answer.

AI shouldn't
think alone.

One model can lie to you.
Nine models rarely agree
on a wrong answer.