InteractiveBench  ·  March 2026

Interactive Benchmarks

A unified evaluation paradigm that assesses LLMs' reasoning ability through active information acquisition — spanning Interactive Proofs (Logic & Math) and Interactive Games (Poker & Trust).

Explore Benchmarks → View Leaderboard
Scroll
🤖
6
Frontier LLMs Evaluated
🧩
4
Benchmark Domains
📊
98
Benchmark Instances
♠️
5K+
Poker Hands Simulated
🏆
76.9%
Best Math Accuracy

Evaluation Framework

Two paradigms, four domains

★ InteractiveBench · Mar 2026
Interactive Proofs & Games

Interactive Benchmarks

A unified evaluation paradigm assessing LLMs' active information-acquisition ability under budget constraints — covering abductive logic reasoning, mathematical solving, strategic poker playing, and adaptive trust games simulating.

Logic · Interactive Proof
Situation Puzzle
46 instances · Best: Gemini 30.4%
Math · Interactive Proof
HLE Math Problems
52 instances · Best: Grok 76.9%
Poker · Interactive Game
Texas Hold'em
5,000 hands · Best: Gemini +31.8/hand
Trust · Interactive Game
Iterated Prisoner's Dilemma
Round-robin · Best: Qwen3 1.867/round
Explore all benchmarks →
Rankings

Interactive Benchmark
Leaderboard

1
GPT-5-mini
OpenAI
77.9
2
Grok-4.1-fast
xAI
76.7
3
Gemini-3-flash
Google DeepMind
74.7
4
Qwen3-max
Alibaba
31.9
5
DeepSeek-v3.2
DeepSeek
21.3
6
Kimi-k2-thinking
Moonshot AI
17.9
Learn

LLM Whiteboard
Sessions

Latest news about our progress on Interactive Benchmarks.

Latest Paper
Mar 5, 2026 · InteractiveBench
Interactive Benchmarks: Evaluating LLMs via Active Information Acquisition
Open Access · arXiv:2603.04737
4 benchmark domains · 6 frontier models →
View all sessions →
Open Source

InteractiveBench

6 L L M s
4 D o m a i n s

Frontier models evaluated across logic, math, poker, and trust game benchmarks.

GitHub arXiv Open Access
View on GitHub →