Methodology
How the 8-model debate actually works
FounderJury’s verdict is produced by eight frontier AI models moving through a four-step framework. First, each model forms an independent judgment with no access to the others — anchoring bias eliminated. Then rubric-anchored scoring across Team, Market, Innovation, Feasibility, and Thesis Fit. Then cross-examination; then synthesis. Every model returns a structured argument citing specific claims from your application as the basis for its verdict.
Readable by a skeptical analyst in 5 minutes. No marketing language, no fake metrics.
Model lineup last updated 2026-04-29.
STEP 1
Why run eight models independently instead of one?
Each evaluation runs the same prompt against 8 frontier models in parallel. No model sees another model’s output before forming its own verdict. This eliminates anchoring — the first response in a chain doesn’t bias the rest.
Each model returns a verdict (GO / PIVOT / NO-GO), a numeric score on five dimensions (Team, Market, Innovation, Feasibility, Thesis Fit), and a structured argument citing the specific claims in the application that drove its conclusion.
STEP 2
How are ideas scored dimension-by-dimension?
The default rubric weights are Team 25%, Market 25%, Innovation 20%, Feasibility 15%, Thesis Fit 15%. Institutional partners can configure their own weights or add custom dimensions per cohort.
Scores are 1–10 per dimension. The aggregate score is a weighted mean, then mapped to a tier band (NO-GO 0–39, PIVOT 40–69, GO 70–100). The mapping is fixed; we do not curve scores across a cohort.
STEP 3
What happens when models disagree on the verdict?
When models split on the verdict, we surface the split rather than averaging it away. A “6 PIVOT · 2 NO-GO” result is not the same signal as “8 PIVOT” — the former tells you a sub-panel saw a fatal flaw the majority did not weigh.
For institutional cohorts the report ranks applications by both the consensus score and the variance of model judgments. High-variance applications are flagged for human review regardless of their score.
STEP 4
How does the Judge catch model hallucinations?
A separate Judge model reviews the eight arguments before the report is finalized. Its job is to flag claims that are not supported by the application text — for example, a model citing a partnership the founder never mentioned. The Judge cannot add evidence, only mark unsupported claims for redaction.
When the Judge redacts a model’s argument, that model’s verdict is still counted but the contested reasoning is hidden from the final report. The redaction itself is logged in the audit trail.
STEP 5
What does a reviewer actually see in the report?
Per applicant: verdict band, dimension scores, the strongest pro and con cited by the panel, three reviewer action items (e.g. “Verify carbon credit partnerships”), and a model-agreement count (“6 of 8 agreed”).
Per cohort: ranked list with verdict and variance, dimension distributions for bias auditing, and an audit-log CSV of every override.
LIMITS
What this is not
This is decision support, not decision delegation. Every report carries a visible AI-Recommendation-Only disclaimer. Final ADVANCE / REVIEW / DECLINE must be set by a qualified human reviewer.
Eight models is wider than a single AI’s opinion, but it is not infallible. Models share training data and can share blind spots — particularly around very recent regulation, niche technical claims, and non-Western markets. Use the cohort variance signal and the Judge’s redactions to spot where the panel is least reliable.
Frequently asked questions
How does AI evaluate startup ideas?
An AI evaluator applies a consistent rubric across predefined dimensions — team, market, innovation, feasibility, thesis fit — rather than freeform opinion. FounderJury.ai uses eight models in parallel instead of one because single-model evaluation inherits that model's training-data biases. Each model independently scores and argues, then a synthesis step reconciles their verdicts into one structured output.
What do investors look for in a startup idea?
Five criteria most-cited by venture investors: a team capable of executing, a large and growing market, a differentiated approach, technical feasibility, and fit with the investor's thesis. FounderJury.ai models each dimension explicitly — every model returns a numeric score on all five, not just a final verdict, so strengths and weaknesses are visible dimension-by-dimension.
What is a startup idea evaluation framework?
A structured scoring system that separates signal from noise across multiple dimensions instead of a single gut-check number. FounderJury implements Multi-Agent Debate (MAD) — eight models scoring the same idea against the same rubric, then debating their verdicts. The foundational paper (ICML 2024) shows this approach substantially reduces hallucinations versus single-model scoring.
What is market-fit AI analysis?
An AI-assisted evaluation of whether a product idea matches a real, demonstrated demand signal. Market-fit AI combines structured probes — demand evidence, substitute analysis, customer-profile sharpness — with multi-model scoring. FounderJury's Market dimension explicitly tests all three. A Market score below 5 is the single most common reason an otherwise strong idea receives a PIVOT verdict.
Want to verify on your own pile?
Request a sample cohort report (PDF)