Back to Dashboard
CategoryWeight: 1.0x
Long Reasoning
Multi-step logic puzzles, extended chain-of-thought, and complex analytical reasoning tasks requiring sustained coherence over many steps.
Best Score
0.0Avg Score
0.0Tests
3Performance Over Time — All Models
Model Rankings
Test Breakdown
Multi-step Logic Puzzle
Complex optimization with 8+ constraints across multiple variables
Claude Sonnet 4.6
70.7Claude Opus 4.8
70.3Grok
68.4GPT-5.5
63.4Legal Reasoning Chain
Contract dispute analysis requiring multi-party obligation tracking
Claude Sonnet 4.6
70.7Claude Opus 4.8
70.3Grok
68.4GPT-5.5
63.4Mathematical Proof
Prove divisibility properties using induction and modular arithmetic
Claude Sonnet 4.6
70.7Claude Opus 4.8
70.3Grok
68.4GPT-5.5
63.4