Multi-Agent Debate · Peer-Reviewed Research

The Science Behind the Jury

FounderJury is built on Multi-Agent Debate (MAD), an LLM architecture validated by five peer-reviewed papers from MIT, Google DeepMind, and global research consortiums. The foundational work (ICML 2024) showed multiple LLMs debating over rounds substantially improves factual accuracy and reduces hallucinations versus single-model prompting. Subsequent papers show adversarial debate pressure further strengthens reasoning. Every model cited on this page is deployed for exactly that purpose.

Standard AI acts as a yes-man — reinforcing its own biases and hallucinating facts you didn't ask for.

FounderJury is built on Multi-Agent Debate (MAD), an architecture independently validated by researchers at MIT, Google DeepMind, and top global institutions.

We still catch hallucinations. No AI is perfect. That's why we show you the Reality Check Protocol — so you always know which outputs to verify yourself.

Which 5 peer-reviewed studies back this architecture?

MIT + Google DeepMind2023 / ICML 2024

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Key Finding

Forcing multiple LLM instances to propose and debate their responses over multiple rounds until they reach a common answer drastically improves factual validity and reduces hallucinations compared to single-model prompting.

How FounderJury Uses This

Our core architecture implements exactly this — 8 models propose independently, then cross-examine each other's reasoning across multiple rounds before the synthesis agent delivers your verdict.

Read paper →
Algoverse AI ResearchOctober 2024

A Debate-Driven Experiment on LLM Hallucinations and Accuracy

Key Finding

When one model deliberately introduces false information into a debate, the truthful majority is challenged to better justify their reasoning — ultimately improving overall accuracy compared to no adversarial pressure.

How FounderJury Uses This

Team East (DeepSeek, Qwen, MiniMax, QwQ) acts as the adversarial pressure — trained on different data, with different assumptions about markets. Their challenges force Team West to produce stronger, better-justified arguments.

Read paper →
Global Research Consortium2024–2025

Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs

Key Finding

Single LLMs suffer from overconfidence bias — when asked to check their own work, they usually agree with their initial mistake. Forcing models to adopt different preset stances breaks this echo chamber and eliminates stubborn hallucinations.

How FounderJury Uses This

This is exactly why asking ChatGPT 'is my idea good?' doesn't work. Our models are assigned adversarial roles from the start — the Contrarian cannot agree, the Devil's Advocate must attack every assumption.

Read paper →
Global ResearchMay 2025

Removal of Hallucination on Hallucination: Debate-Augmented RAG

Key Finding

Simply giving AI access to web search often makes hallucinations worse if the search returns bad data. Multi-Agent Debate filters web data by having independent agents cross-examine retrieved information before accepting it.

How FounderJury Uses This

When Grok pulls real-time market data, our Reality Check Protocol (DeepSeek independently verifying Claude's synthesis) catches cases where live data contradicts the verdict before you see it.

Read paper →
Research ConsortiumSeptember 2025 / Updated January 2026

Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

Key Finding

The Multi-Agent Consensus Alignment (MACA) debate structure produces a +26.8% improvement in advanced reasoning and +27.6% improvement in self-consistency compared to standard single-prompt AI approaches.

How FounderJury Uses This

Scientifically validated: structured debate between multiple models improves reasoning quality by over 25% compared to asking a single AI the same question.

Read paper →

Frequently asked questions

What is LLM startup idea scoring?

A scoring method that uses Large Language Models to evaluate startup ideas against a structured rubric. Single-model LLM scoring suffers from training-data bias and hallucination. Multi-Agent Debate — multiple LLMs debating their scores over rounds — substantially improves factual accuracy (MIT + Google DeepMind, ICML 2024). FounderJury implements MAD directly at the product level.

What the science doesn't guarantee

Academic validation of the MAD architecture means the debate format reduces errors compared to single-model prompting. It does not mean zero errors. Our Reality Check Protocol (DeepSeek verifying Claude's output) exists precisely because no AI — debating or not — is always right. Run every domain suggestion through dpma.de before buying. Verify every market size claim independently. The jury gives you better odds. Not certainty.