Interactive Proofs & Games · InteractiveBench
Six frontier LLMs evaluated across four interactive domains — Logic, Math, Texas Hold'em, and Trust Game. Composite score = normalized average across all four benchmarks.
| # | Model | Composite Score |
|---|---|---|
| 1 |
GPT-5-mini
BEST OVERALL
OpenAI · Math 73.1% · Logic 17.4% · Trust 1.836/rd
|
|
| 2 |
Grok-4.1-fast
BEST MATH
xAI · Math 76.9% · Logic 15.2% · Trust 1.804/rd
|
|
| 3 |
Gemini-3-flash
BEST LOGIC & POKER
Google DeepMind · Math 61.5% · Logic 30.4% · Poker +31.8/h
|
|
| 4 |
Qwen3-max
BEST TRUST
Alibaba · Math 46.2% · Logic 4.3% · Trust 1.867/rd
|
|
| 5 |
DeepSeek · Math 48.1% · Logic 15.2% · Trust 1.648/rd
|
|
| 6 |
Moonshot AI · Math 34.6% · Logic 6.5% · Trust 1.779/rd
|
Scoring Methodology
Each domain score is min-max normalized to [0, 100], then averaged equally across Logic (Situation Puzzle accuracy), Math (HLE interactive accuracy), Poker (avg winnings/hand normalized), and Trust Game (avg payoff/round normalized).
Interactive Proofs
Models act as the Player and interact with a Grok-4.1-fast Judge. For Logic: 20-turn budget, yes/no questions on 46 Situation Puzzles. For Math: 20-turn budget on 52 HLE instances.
Interactive Games
Texas Hold'em: 5,000 hands across 10 tables, 6 LLM agents, No-Limit rules. Trust Game: round-robin iterated Prisoner's Dilemma with random horizon (δ-geometric), measuring cooperation rate and betrayal rate.