6 models Ranked by Composite Score (4-domain normalized average) Logic · Math · Poker · Trust Game InteractiveBench · Mar 5, 2026
# Model Organization Composite Score Logic Acc. Math Acc. Poker $/hand Trust Game (avg/rd)
1
GPT-5-mini BEST OVERALL
OpenAI · Math 73.1% · Logic 17.4% · Trust 1.836/rd
OpenAI
77.9
17.4% 73.1%
+22.2/h
1.836
2
Grok-4.1-fast BEST MATH
xAI · Math 76.9% · Logic 15.2% · Trust 1.804/rd
xAI
76.7
15.2% 76.9%
+27.9/h
1.804
3
Gemini-3-flash BEST LOGIC & POKER
Google DeepMind · Math 61.5% · Logic 30.4% · Poker +31.8/h
Google DeepMind
74.7
30.4% 61.5%
+31.8/h
1.725
4
Qwen3-max BEST TRUST
Alibaba · Math 46.2% · Logic 4.3% · Trust 1.867/rd
Alibaba
31.9
4.3% 46.2%
−30.4/h
1.867
5
DeepSeek · Math 48.1% · Logic 15.2% · Trust 1.648/rd
DeepSeek
21.3
15.2% 48.1%
−23.2/h
1.648
6
Moonshot AI · Math 34.6% · Logic 6.5% · Trust 1.779/rd
Moonshot AI
17.9
6.5% 34.6%
−28.3/h
1.779

Scoring Methodology

4-Domain Composite Score

Each domain score is min-max normalized to [0, 100], then averaged equally across Logic (Situation Puzzle accuracy), Math (HLE interactive accuracy), Poker (avg winnings/hand normalized), and Trust Game (avg payoff/round normalized).

Interactive Proofs

Budget-Constrained Interaction

Models act as the Player and interact with a Grok-4.1-fast Judge. For Logic: 20-turn budget, yes/no questions on 46 Situation Puzzles. For Math: 20-turn budget on 52 HLE instances.

Interactive Games

Strategic Reasoning Under Uncertainty

Texas Hold'em: 5,000 hands across 10 tables, 6 LLM agents, No-Limit rules. Trust Game: round-robin iterated Prisoner's Dilemma with random horizon (δ-geometric), measuring cooperation rate and betrayal rate.