Logic · Interactive Proof

Situation Puzzle

Player asks yes/no-style questions to reconstruct a hidden narrative explanation within a 20-turn budget. Responses are limited to {yes, no, both, irrelevant}. All models score 0% without interaction — forcing genuine reasoning. Dataset: 46 expert-curated, high-difficulty instances. Judge: Grok-4.1-fast (temp=0).

46 instances · Best: Gemini 30.4%
Math · Interactive Proof

HLE Math Problems

Player interacts with a judge holding a reference derivation, querying validity of intermediate steps (lemmas, equations) within a 20-turn budget. Outperforms pass@k baseline by 20–50% under matched token budget. Dataset: 52 challenging instances from the HLE benchmark. Judge: Grok-4.1-fast (temp=0).

52 instances · Best: Grok 76.9%
Poker · Interactive Game

Texas Hold'em

Six LLM agents compete in No-Limit Texas Hold'em across 10 independent tables (500 hands each, 5,000 total). Agents receive structured observations (stage, hole cards, community cards, stack sizes, pot odds, action history) and output FOLD / CHECK / CALL / RAISE / ALL_IN. Metrics: avg winnings/hand, VPIP, fold rate, latency.

5,000 hands · Best: Gemini +31.8/hand
Trust · Interactive Game

Iterated Prisoner's Dilemma

Round-robin tournament of repeated Prisoner's Dilemma with random horizon (geometric distribution, continuation prob δ). Each round: simultaneous COOPERATE / DEFECT. Payoffs: (C,C)→(2,2), (D,C)→(3,−1), (C,D)→(−1,3), (D,D)→(0,0). Metrics: avg payoff/round, cooperation rate, betrayal rate. Baselines: Grim Trigger (1.811) and TFT (1.782).

Round-robin · Best: Qwen3 1.867/round

Evaluation framework

Each benchmark is modeled as a horizon-T interaction between model π and environment E

Interactive Proofs

Convergent Regime

The model (Player) interacts with an omniscient Judge holding a hidden ground truth. Under a fixed query budget B, the Player asks questions and receives restricted feedback {yes, no, both, irrelevant}. Objective: maximize probability of correct final answer. Evaluated on Logic and Math domains.

Interactive Games

Divergent Regime

No dedicated Judge — the model interacts with other agents and a stochastic environment to maximize long-horizon utility. Objective: maximize expected discounted cumulative payoff. Evaluated on Texas Hold'em Poker and iterated Prisoner's Dilemma (Trust Game).

Key Finding

Substantial Room to Improve

Interactive benchmarks reveal capabilities invisible to static tests. Best Logic accuracy: 30.4% (Gemini). Best Math accuracy: 76.9% (Grok). Pass@k underestimates capability by 20–50% vs interactive evaluation under same token budget. Models show highly domain-specific strengths.