Benchmark Domains — Interactive Benchmarks

Logic · Interactive Proof

Situation Puzzle

Player asks yes/no-style questions to reconstruct a hidden narrative explanation within a 20-turn budget. Responses are limited to {yes, no, both, irrelevant}. All models score 0% without interaction — forcing genuine reasoning. Dataset: 46 expert-curated, high-difficulty instances. Judge: Grok-4.1-fast (temp=0).

46 instances · Best: Gemini 30.4% →

Math · Interactive Proof

HLE Math Problems

Player interacts with a judge holding a reference derivation, querying validity of intermediate steps (lemmas, equations) within a 20-turn budget. Outperforms pass@k baseline by 20–50% under matched token budget. Dataset: 52 challenging instances from the HLE benchmark. Judge: Grok-4.1-fast (temp=0).

52 instances · Best: Grok 76.9% →

Poker · Interactive Game

Texas Hold'em

Six LLM agents compete in No-Limit Texas Hold'em across 10 independent tables (500 hands each, 5,000 total). Agents receive structured observations (stage, hole cards, community cards, stack sizes, pot odds, action history) and output FOLD / CHECK / CALL / RAISE / ALL_IN. Metrics: avg winnings/hand, VPIP, fold rate, latency.

5,000 hands · Best: Gemini +31.8/hand →

Trust · Interactive Game

Iterated Prisoner's Dilemma

Round-robin tournament of repeated Prisoner's Dilemma with random horizon (geometric distribution, continuation prob δ). Each round: simultaneous COOPERATE / DEFECT. Payoffs: (C,C)→(2,2), (D,C)→(3,−1), (C,D)→(−1,3), (D,D)→(0,0). Metrics: avg payoff/round, cooperation rate, betrayal rate. Baselines: Grim Trigger (1.811) and TFT (1.782).

Round-robin · Best: Qwen3 1.867/round →

Interactive
Benchmarks

Situation Puzzle

HLE Math Problems

Texas Hold'em

Iterated Prisoner's Dilemma

Evaluation framework

Convergent Regime

Divergent Regime

Substantial Room to Improve

InteractiveBenchmarks

Situation Puzzle

HLE Math Problems

Texas Hold'em

Iterated Prisoner's Dilemma

Evaluation framework

Convergent Regime

Divergent Regime

Substantial Room to Improve

Interactive
Benchmarks