Back to Dashboard
CategoryWeight: 1.0x

Long Reasoning

Multi-step logic puzzles, extended chain-of-thought, and complex analytical reasoning tasks requiring sustained coherence over many steps.

Best Score

0.0

Avg Score

0.0

Tests

3

Performance Over Time — All Models

Model Rankings

1
Claude Sonnet 4.6

Category score

View
70.7BEST
Tokens56.0k
Total56.0k
2
Claude Opus 4.8

Category score

View
70.3-0.4 pts
Tokens19.1k
Total19.1k
3
Grok

Category score

View
68.4-2.3 pts
Tokens67.9k
Total67.9k
4
GPT-5.5

Category score

View
63.4-7.3 pts
Tokens38.3k
Total38.3k

Test Breakdown

Multi-step Logic Puzzle

Complex optimization with 8+ constraints across multiple variables

Claude Sonnet 4.6
70.7
Claude Opus 4.8
70.3
Grok
68.4
GPT-5.5
63.4

Legal Reasoning Chain

Contract dispute analysis requiring multi-party obligation tracking

Claude Sonnet 4.6
70.7
Claude Opus 4.8
70.3
Grok
68.4
GPT-5.5
63.4

Mathematical Proof

Prove divisibility properties using induction and modular arithmetic

Claude Sonnet 4.6
70.7
Claude Opus 4.8
70.3
Grok
68.4
GPT-5.5
63.4