A unified evaluation paradigm that assesses LLMs' reasoning ability through active information acquisition — spanning Interactive Proofs (Logic & Math) and Interactive Games (Poker & Trust).
Evaluation Framework
A unified evaluation paradigm assessing LLMs' active information-acquisition ability under budget constraints — covering abductive logic reasoning, mathematical solving, strategic poker playing, and adaptive trust games simulating.
Latest news about our progress on Interactive Benchmarks.
Frontier models evaluated across logic, math, poker, and trust game benchmarks.