A doctor who's never seen your rare disease can still diagnose it from symptoms. They can identify a pattern beyond their direct experience. An interpolator would guess it statistically from similar known cases — and often get it wrong.
You ask three colleagues for input before an important decision. You read multiple newspapers to get a balanced view. You request a second medical opinion. But when you query AI, you ask one model — and treat the output as fact.
You ask the same question to GPT-4, Claude, and Gemini. GPT-4 answers A. Claude answers B. Gemini answers C. All three answers sound credible. Which is correct — or are all three wrong?
GPT-4 is more accurate than GPT-3. Claude Opus outperforms Claude Sonnet. Gemini Ultra achieves better results than Gemini Pro. Scaling works on average.
January 2024. A research team didn't publish a new benchmark or a method that reduces hallucinations by another X%. They published a mathematical proof: LLMs as general-purpose solvers will always hallucinate — regardless of model size, training quality, or data volume.