LLM Council: How Five Models Vote and How to Calculate the Result

Five models receive the same question. Three agree, one abstains, one disagrees. How do you calculate the result?

Simple majority is not enough. It matters how confident each "voter" was. How much the arguments differed. And whether the outlier caught something the other four missed.

Claims Framework

What this article claims: Weighted voting across multiple models outperforms simple majority. Model diversity (different vendors, different data) matters more than model count. Outlier votes can be more valuable than the majority result. The entire principle rests on Condorcet's jury theorem.

What it is based on: Condorcet's theorem (1785) as mathematical foundation; Wang et al. (2024) -- Mixture-of-Agents architecture; Cui et al. (2025) -- Free-MAD principle without forced convergence.

Where it simplifies: Condorcet's theorem requires voter independence and competence -- for LLMs trained on similar data, independence is debatable. The article does not present its own empirical measurements. The outlier type classification is a conceptual framework, not an empirically validated taxonomy.

Aggregating judgments is not a new problem. Jury systems, democratic voting, scientific peer review, committee decisions — humans have been solving the aggregation of differing judgments for thousands of years. Social choice theory formalized the conditions under which aggregation works, and the conditions under which it fails.

An LLM council transfers this principle to AI: instead of one model, we query multiple models and aggregate their answers. But naive implementation — simple majority — loses most of the value. What matters is the diversity of the "voters," their degree of confidence, and how disagreement is interpreted.

This article analyzes the mathematics of LLM council: why weighted voting outperforms simple majority, how to detect outlier models and when to listen to them, and why model diversity — not model count — determines output quality.

Condorcet's Jury Theorem: The Mathematics of When Aggregation Works

The Marquis de Condorcet in 1785 formalized the intuition behind the jury system: if each jury member decides correctly more than half the time and decisions are mutually independent, the probability of a correct collective decision grows with the number of jurors. With a large enough number of independent voters, it converges to certainty.

This applies to AI as well: if each model performs "better than chance" on a given type of question and responses are genuinely independent, aggregation improves the result compared to a single model.

But the theorem holds only under two conditions. Both are critical.

The independence condition. Votes must be genuinely independent. If models share training data or value frameworks, their responses are correlated — aggregation then amplifies shared biases rather than averaging them out. Five models trained on similar data with similar preference procedures are not five independent voters. They are one perspective repeated five times with minor variations.

The competence condition. Each model must perform "better than chance" in the given domain. If a model is systematically poor at a particular type of question — a specialized medical question for a model without healthcare training — including it in the council worsens the result. Aggregating incompetent voters does not produce a competent result.

Practical consequence: before assembling an LLM council, verify two things. First, are the models genuinely diverse — different vendors, different training data, different alignment procedures? Second, are all models competent for the type of question being asked?

If not, the council will provide the appearance of confidence without actual epistemic value.

Weighted Voting: How to Account for Degree of Confidence

Simple majority ignores how confident each model is in its answer. Weighted voting, where weight reflects confidence, extracts more information from the same number of responses.

Consider two scenarios:

Scenario A: Three models answer "recommend X" with high confidence. Two models answer "do not recommend X" with low confidence. Simple majority: X. Weighted voting: X, with a significant margin.

Scenario B: Three models answer "recommend X" with low confidence. Two models answer "do not recommend X" with high confidence. Simple majority: still X. Weighted voting: X may have a lower weighted sum than not-X. The result can reverse.

Simple majority gives the same result in both scenarios. Weighted voting distinguishes situations where confident minority arguments outweigh an uncertain majority.

How to measure confidence in LLMs? Direct querying works as a rough approximation — "How confident are you in this answer on a scale of 1–10?" Consistency across repeated samples is a more robust proxy: a model that always answers the same way across ten samples is likely more confident than one whose answers vary. If the API provides token probabilities, they can be used directly.

A concrete calculation example:

| Model | Answer | Confidence | |-------|--------|-----------| | A | Recommend X | 8/10 | | B | Recommend X | 6/10 | | C | Recommend X | 9/10 | | D | Do not recommend X | 7/10 | | E | Do not recommend X | 4/10 |

Weighted sum for X: 8 + 6 + 9 = 23. For not-X: 7 + 4 = 11. The result for X is stronger than simple majority shows (3:2 vs. 23:11).

Weighted voting is asymmetrically valuable for questions where one model is highly confident and the rest are uncertain — in such cases, a confident outlier can legitimately outweigh an uncertain majority.

Outlier Detection: When One Model Saves the Entire Council

An outlier — a model with a minority answer — is not automatically an error. Sometimes it catches a perspective or fact that the other four missed. The key is distinguishing a valuable outlier from an erroneous one.

In a standard voting process, the outlier's result is outvoted by the majority. In an epistemic context, the outlier may be the most valuable part of the output. One expert who sees the problem differently from the other four may be right.

There are four types of outliers:

Erroneous outlier. The model made a fundamental factual error that the other four did not. Identifiable by cross-referencing with sources. This outlier is correctly outvoted.

Stylistic outlier. The model responds with a different format or length but agrees substantively with the majority. Irrelevant to voting — should not affect the weight of the result.

Perspective outlier. The model sees the question from a different angle — a different value framework, different cultural context, different interpretation of a key term. This information is valuable and overruling it loses it. Rather than ignoring it, it should be explicitly documented alongside the majority result.

Expert outlier. The model has specialized training data in the relevant domain and sees a detail that others missed. Outvoting this outlier loses the expert perspective.

Practical procedure: do not interpret the council result as "the vote result, full stop." Document the outlier result alongside the majority result. When a model disagrees, ask it why — its argument may be stronger than it appears at first glance. Assess whether the disagreement is factual, value-based, or perspective-based. Factual disagreements are resolvable through cross-checking. Value disagreements are information about where the problem has no unambiguous answer.

The output of an LLM council is "vote result plus documentation of disagreement." Disagreement is information, not noise.

Model Diversity as the Foundation of Quality

Five copies of the same model will produce a worse result than five different models from different vendors. Diversity — not count — determines aggregation quality.

Research on Mixture-of-Agents architectures (Wang et al., 2024) shows that a setup with diverse models produces higher quality outputs than a homogeneous set. Different vendors bring different training data, different corporate cultures, different safety approaches, and different preference procedures during training. This diversity creates genuinely different perspectives with epistemic value.

Types of diversity in an LLM council:

Vendor diversity. OpenAI, Anthropic, Google, Mistral — different cultures, different safety philosophies, different training data sources. This diversity is the strongest guarantee of genuine independence.

Size diversity. Smaller models may be more precise in narrow domains where they are specialized; larger models see broader context. Combining them brings different levels of granularity.

Specialization diversity. A model trained on legal texts versus one trained on medical literature has different areas of expertise. For domain-specific questions, specialization may be more valuable than general competence.

Alignment philosophy diversity. Models with different safety constraints bring different value frameworks and different approaches to ambiguous questions.

The Free-MAD principle (Cui et al., 2025) proposes a variant where the council does not push for consensus. Models present different positions and a separate evaluator delivers the final verdict. This prevents false convergence — the situation where models unify their answers not because they accepted a better argument, but because they yielded to group pressure.

When assembling an LLM council: optimize for diversity, not count. Three genuinely diverse models are more valuable than five homogeneous ones.

Failure Modes of LLM Council

An LLM council fails predictably. Understanding the failure modes is a prerequisite for correct use.

Failure mode 1: Correlated errors. All models share a training gap — consensus amplifies a shared blind spot (discussed in detail in The Invisible Echo Chambers of AI). Symptom: all models confidently agree on a question where legitimate disagreement should exist.

Failure mode 2: False convergence. Models that initially disagree unify after reading each other's answers — not because they accepted a better argument, but because they yielded to convergence pressure. Solution: anonymize answers, or use multi-round debate with an explicit adversarial role.

Failure mode 3: Wrong aggregation scheme. A question with a single correct answer (factual) requires majority vote. A question with multiple legitimate answers (value-based, perspective-based) requires presenting diversity, not a single result. Applying the wrong scheme to the wrong type of question is a fundamental error.

Failure mode 4: Incompetent voters. Models that are not competent in the given domain reduce, rather than improve, aggregation quality. Condorcet's theorem assumes each voter performs "better than chance" — if that does not hold, aggregation is counterproductive.

Pre-flight checklist: Are the models genuinely diverse? Are they competent for the type of question? Does the aggregation scheme match the nature of the question?

At What Cost?

An LLM council is more complex, more expensive, and slower than a single-model approach. Five models mean roughly five times the token costs, higher latency, and orchestration overhead. And the result still requires human interpretation.

The investment pays off for specific types of questions: those with relevant uncertainty where a single model might err; strategic decisions where different perspectives have value; and factual cross-checks where the cost of error exceeds the cost of complexity.

It is not worth it for routine queries, quick lookups, or situations where latency is critical. An LLM council is not a replacement for single-model — it is an escalation for questions where the cost of error exceeds the cost of complexity.

The Mathematics Has Conditions

An LLM council works. It has a mathematical foundation — Condorcet's jury theorem applied to AI. But it works only under conditions that must be met: genuine independence of models, their competence in the relevant domain, and correct selection of the aggregation scheme.

It fails predictably when these conditions are violated. Correlated errors, false convergence, incompetent voters — these are failure points that naive implementation will not catch.

Understanding the mathematics of aggregation is as important as the voting itself. The principles of LLM council — weighted voting, outlier analysis, model diversity — are applicable in any multi-model approach. CrossChat implements them as a structured workflow; the principles are transferable to any set of models you assemble manually.

Sources

Wang, T. et al. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv:2406.04692. DOI: 10.48550/arXiv.2406.04692
Cui, Y. et al. (2025). Free-MAD: Consensus-Free Multi-Agent Debate. arXiv:2509.11035. DOI: 10.48550/arXiv.2509.11035
Condorcet, M. de (1785). Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. (Historical source; jury theorem as theoretical framework.)

Editorial History

Concept: Claude Code + Anthropic Sonnet 4.6 Version 1: Claude Code + Anthropic Sonnet 4.6 Version 2: Codex + GPT-5.2 Quality audit (2026-03-23, Claude Code + Claude Opus 4.6): added Claims Framework, verified sources, language polish.