Self-Consistency: Why 20 Answers Beat One Best

Ask a model to solve a math problem. You get an answer. Then ask it again many times (for example, twenty). Record the most frequent result. Accuracy can jump dramatically — not by changing the model, but by aggregating multiple attempts.

In Wang et al. (Self-Consistency, ICLR 2023), a Chain-of-Thought + greedy-decoding baseline on GSM8K scored 56.5%, while self-consistency (majority vote over multiple sampled reasoning paths) reached 74.4% on PaLM-540B (Table 2). The paper uses 40 sampled paths in the main results; “twenty” here is an intuitive mental model for “repeat-and-vote,” not the exact experimental setting.

Claims Framework

What this article claims: Generating multiple answers and selecting the most frequent result (majority voting) systematically outperforms a single attempt on factual and logical tasks. The improvement on GSM8K is +17.9 percentage points. The principle works analogously to wisdom-of-crowds and scientific replication.

What it is based on: Wang et al. (ICLR 2023) with specific results on GSM8K, SVAMP, AQuA, and StrategyQA; Galton's aggregation demonstration (1907); general statistical convergence principles under independent errors.

Where it simplifies: The article assumes error independence across samples, which may not hold for a single model (correlated hallucinations). The analogies to juries and scientific replication simplify the conditions under which aggregation actually works. The cost analysis (20x price) is simplified -- in practice it depends on response length and model pricing.

Intuition says: the best answer is the most carefully considered one. One good attempt beats twenty average ones. That's true for human experts — a neurosurgeon operates once, an architect designs one bridge. For LLM models, this intuition fails.

Wang et al. (ICLR 2023) demonstrated that generating many different reasoning paths and selecting the most frequent result systematically outperforms a single best-of-1 attempt — by 17.9 percentage points on the mathematical benchmark GSM8K. The technique is called Self-Consistency (SC), and its principle is surprisingly simple.

The Self-Consistency Mechanism: How Majority Voting Beats Best-of-1

Self-Consistency operates on the principle of aggregating independent reasoning paths. The more paths lead to the same result, the higher the probability that result is correct.

The procedure is concrete and replicable. Step 1: Generate N answers to the same question with non-zero temperature (temperature > 0 ensures variability — each attempt takes a slightly different path). Step 2: Each answer traverses its own reasoning chain to a result. Different paths may lead to the same or different results. Step 3: Tally the results. The result appearing most frequently is the final answer (majority voting). No weighting, no selection of the "best" answer — pure aggregation.

Wang et al. tested SC across three categories. Mathematical reasoning (GSM8K, SVAMP, AQuA), symbolic reasoning, and commonsense reasoning. The improvement was consistent — not specific to one task type.

Concrete examples from the paper:

GSM8K, PaLM-540B: 56.5% (CoT prompting + greedy decode) → 74.4% (self-consistency), +17.9 pp (Table 2).
GSM8K, GPT-3 code-davinci-002: 60.1% → 78.0%, +17.9 pp (Table 2).

Key insight: a correct reasoning path is more stable than an incorrect one. If the model "knows" how to solve a problem, it likely reaches the same result by various paths. If it hallucinates, each path goes somewhere different — variance is high, no result dominates.

SC exploits the structure of distributed hallucination failure. Hallucinations are "scattered" — each one goes somewhere different. Correct answers are "concentrated" — they converge to a single result. Majority voting explicitly exploits this difference.

Why It Works: The Statistical Mechanics of Aggregation

Self-Consistency is not a trick technique. It is an application of a statistical principle that works far beyond AI.

Imagine 20 independent estimates of the number of marbles in a jar. The average of the estimates will outperform any individual estimate — even the best. Francis Galton empirically documented this "wisdom of crowds" phenomenon in 1907, analyzing a fairground weight-guessing contest and showing that aggregation can be remarkably accurate even when many individual guesses are noisy.

The mathematical justification for LLMs: if a model generates the correct answer with probability p > 0.5 (better than chance), and errors are distributed independently (each hallucination goes somewhere different), then aggregating N samples converges to the correct answer with probability approaching 1 as N grows. Critical assumption: independence of errors.

Practical example: A historical event date. The model answers "1847" in 12 of 20 attempts, "1849" in 5, "1851" in 3. SC selects 1847. Based on the result's dominance, you know the answer is likely reliable.

Bonus — distribution as a confidence signal: if answers were evenly distributed (5/5/5/5), SC would signal high uncertainty about the answer. SC doesn't just improve accuracy — it generates a calibrated measure of confidence as a byproduct.

When Self-Consistency Delivers the Greatest Benefit

SC delivers the greatest benefit on tasks with high variance in single-model answers — meaning where the model "knows" the answer but sometimes hallucinates the path to it.

SC works excellently on mathematical and logical problems with clear correct answers, factual questions with verifiable answers, and multi-step reasoning (planning, deduction, causal analysis). On GSM8K (math word problems), SC delivered +17.9 pp (Table 2). On StrategyQA (commonsense reasoning), it delivered +6.4 pp on GPT-3 code-davinci-002 (Table 3).

SC provides less benefit or doesn't work on open creative tasks (no "correct" answer exists for voting), value judgments (result distribution isn't concentrated at truth), and questions requiring extremely specialized knowledge where the model hallucinates consistently.

Practical heuristic: if your question has one correct answer that should be consistent (math, facts, logic), SC is a valuable technique. If the question requires creativity or perspectival judgment, SC isn't appropriate. If you're unsure: generate 5 answers and watch the distribution. High variance = SC won't help. Low variance = SC may help, but you probably don't need it.

Costs and Tradeoffs — When SC Is Impractical

SC is computationally and financially more expensive than single-query approaches, and the tradeoffs are real.

20 samples = 20× more API calls = 20× higher cost and latency. For real-time applications (chatbots, live assistance), SC is impractical. For analytical tasks where accuracy outweighs speed, the tradeoff is acceptable and calculable.

In the paper’s analysis, increasing the number of sampled reasoning paths tends to improve accuracy, with diminishing returns (see Figure 8). In practice, you can treat “more samples” as an adjustable knob: you buy accuracy with additional cost and latency.

When the cost is justified: high-stakes decisions with significantly asymmetric error costs (medical diagnosis, legal analysis, security assessment). For trivial questions, SC is unnecessary overhead.

Low-cost alternative: ask the model for an explicit self-consistency check: "Calculate the result two different ways and verify they agree." Less robust than statistical SC, but practically zero additional cost.

Self-Consistency as a Human Principle — From Science to Law

Aggregating independent estimates as an epistemic principle is not an AI-specific technique. It is the methodological foundation of humanity's most important decision-making processes.

Scientific replication is SC for experiments: if an experiment's results cannot be replicated, it was likely noise or error. The replication protocol is precisely SC: repeat the measurement, check whether results converge. Galton's ox worked because 800 estimates were genuinely independent — each visitor estimated alone, without seeing others' guesses.

Meta-analysis is SC for studies: aggregating results from multiple studies outperforms relying on one. That's exactly why meta-analysis is the highest form of medical evidence. A jury is SC for factual judgment: 12 independent evaluators who must reach agreement is more robust than a single judge.

Implication: SC is not an AI trick — it's an application of the principle humans use in contexts where correctness matters most. If you trust scientific replication, you have reason to trust SC for factual reasoning. If you believe a jury is more robust than a single judge, you have reason to prefer aggregating multiple attempts over one.

Limits of SC: Correlated Errors and Hallucination Consensus

SC fails when model errors are not independently distributed — meaning when the model systematically hallucinates the same thing across all samples.

SC assumes errors are scattered: each hallucination goes somewhere different, the correct answer dominates by frequency. If the model consistently hallucinates the same thing (shared bias in training data), voting amplifies the hallucination instead of correcting it. 20 of 20 samples agreeing on a wrong answer is confident hallucination, not truth.

When consistent hallucination threatens: topics with inadequate representation in training data, events after the knowledge cutoff, overly specific claims where the model knows nothing and interpolates consistently.

Defense: monitor the SC result distribution. If consensus is very high (18/20) on a question where you'd expect low consensus (complex, niche topic), it's likely consistent hallucination — not confirmation of correctness. Supplement SC with external verification on high-stakes claims.

Practical Conclusions

Self-Consistency is a simple technique with surprisingly strong results. Four principles for application:

Use SC on factual and reasoning-heavy queries. Math, logic, factual questions — generate 5–10 answers and take the most frequent result. Three basic tools are enough: repeated queries, result recording, majority voting.

Read the distribution as calibrated confidence. High consensus (8/10 agree) = high confidence. Even distribution (3/3/2/2) = genuine uncertainty. Not just the result — the distribution is information too.

Don't combine SC with creative tasks. On questions without an objectively correct answer, SC has nothing to aggregate. Voting works only when a "truth" exists that correct results converge toward.

Watch for anomalies. Very high consensus on a niche or uncertain topic is a warning signal of consistent hallucination — not confirmation of correctness. Add external verification.

CrossChat implements Self-Consistency as an optional workflow parameter — configurable sample count and visualization of result distribution. Instead of manually running 20 queries and aggregating them, you get a structured output with the answer distribution and resulting consensus value.

Sources

Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171. DOI: 10.48550/arXiv.2203.11171. (Introduces self-consistency; reports gains on GSM8K.)
Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168. DOI: 10.48550/arXiv.2110.14168. (Introduces GSM8K verifier training setup.)
Galton, F. (1907). Vox Populi. Nature 75, 450-451. DOI: 10.1038/075450a0. (Early empirical demonstration of aggregation accuracy.)
Surowiecki, J. (2004). The Wisdom of Crowds. (Book; ISBN: 978-0385503860.)

Evidence Map (Wang et al., arXiv:2203.11171)

Table 1: Aggregation strategies on PaLM-540B (GSM8K includes the 56.5 and 74.4 numbers under greedy decode vs majority vote).
Table 2: Main arithmetic results. GSM8K: PaLM-540B 56.5 → 74.4 (+17.9 pp); GPT-3 code-davinci-002 60.1 → 78.0 (+17.9 pp).
Table 3: Commonsense/symbolic results. StrategyQA: GPT-3 code-davinci-002 73.4 → 79.8 (+6.4 pp).
Table 9: Prompt-robustness on GSM8K (PaLM-540B).
Figure 8: Accuracy vs. number of sampled reasoning paths (PaLM-540B).

Editorial History

Concept: Claude Code + Anthropic Sonnet 4.6 Version 1: Claude Code + Anthropic Sonnet 4.6 Version 2: Codex + GPT-5.2 Quality audit (2026-03-23, Claude Code + Claude Opus 4.6): added Claims Framework, verified sources, language polish.