Situation Puzzle
Player asks yes/no-style questions to reconstruct a hidden narrative explanation within a 20-turn budget. Responses are limited to {yes, no, both, irrelevant}. All models score 0% without interaction — forcing genuine reasoning. Dataset: 46 expert-curated, high-difficulty instances. Judge: Grok-4.1-fast (temp=0).
HLE Math Problems
Player interacts with a judge holding a reference derivation, querying validity of intermediate steps (lemmas, equations) within a 20-turn budget. Outperforms pass@k baseline by 20–50% under matched token budget. Dataset: 52 challenging instances from the HLE benchmark. Judge: Grok-4.1-fast (temp=0).
Texas Hold'em
Six LLM agents compete in No-Limit Texas Hold'em across 10 independent tables (500 hands each, 5,000 total). Agents receive structured observations (stage, hole cards, community cards, stack sizes, pot odds, action history) and output FOLD / CHECK / CALL / RAISE / ALL_IN. Metrics: avg winnings/hand, VPIP, fold rate, latency.
Iterated Prisoner's Dilemma
Round-robin tournament of repeated Prisoner's Dilemma with random horizon (geometric distribution, continuation prob δ). Each round: simultaneous COOPERATE / DEFECT. Payoffs: (C,C)→(2,2), (D,C)→(3,−1), (C,D)→(−1,3), (D,D)→(0,0). Metrics: avg payoff/round, cooperation rate, betrayal rate. Baselines: Grim Trigger (1.811) and TFT (1.782).