Leaderboard — Interactive Benchmarks

6 models Ranked by Composite Score (4-domain normalized average) Logic · Math · Poker · Trust Game InteractiveBench · Mar 5, 2026

#	Model ↕	Organization ↕	Composite Score ↕	Logic Acc. ↕	Math Acc. ↕	Poker $/hand ↕	Trust Game (avg/rd) ↕
1	GPT-5-mini BEST OVERALL OpenAI · Math 73.1% · Logic 17.4% · Trust 1.836/rd	OpenAI	77.9	17.4%	73.1%	+22.2/h	1.836
2	Grok-4.1-fast BEST MATH xAI · Math 76.9% · Logic 15.2% · Trust 1.804/rd	xAI	76.7	15.2%	76.9%	+27.9/h	1.804
3	Gemini-3-flash BEST LOGIC & POKER Google DeepMind · Math 61.5% · Logic 30.4% · Poker +31.8/h	Google DeepMind	74.7	30.4%	61.5%	+31.8/h	1.725
4	Qwen3-max BEST TRUST Alibaba · Math 46.2% · Logic 4.3% · Trust 1.867/rd	Alibaba	31.9	4.3%	46.2%	−30.4/h	1.867
5	DeepSeek-v3.2 DeepSeek · Math 48.1% · Logic 15.2% · Trust 1.648/rd	DeepSeek	21.3	15.2%	48.1%	−23.2/h	1.648
6	Kimi-k2-thinking Moonshot AI · Math 34.6% · Logic 6.5% · Trust 1.779/rd	Moonshot AI	17.9	6.5%	34.6%	−28.3/h	1.779

Scoring Methodology

4-Domain Composite Score

Each domain score is min-max normalized to [0, 100], then averaged equally across Logic (Situation Puzzle accuracy), Math (HLE interactive accuracy), Poker (avg winnings/hand normalized), and Trust Game (avg payoff/round normalized).

Interactive Proofs

Budget-Constrained Interaction

Models act as the Player and interact with a Grok-4.1-fast Judge. For Logic: 20-turn budget, yes/no questions on 46 Situation Puzzles. For Math: 20-turn budget on 52 HLE instances.

Interactive Games

Strategic Reasoning Under Uncertainty

Texas Hold'em: 5,000 hands across 10 tables, 6 LLM agents, No-Limit rules. Trust Game: round-robin iterated Prisoner's Dilemma with random horizon (δ-geometric), measuring cooperation rate and betrayal rate.

InteractiveBenchmarks

4-Domain Composite Score

Budget-Constrained Interaction

Strategic Reasoning Under Uncertainty

Interactive
Benchmarks