Compare Models
Side-by-side performance comparison across all benchmark categories.
Select Models to Compare
Category Radar Comparison
The radar visualization is shown on wider screens. On mobile, use the detailed comparison cards below for exact per-category, composite, and token-efficiency values without clipped labels.
Performance Over Time
Detailed Score Comparison
Composite
Claude Opus 4.8
93.9
Claude Sonnet 4.6
92.1
GPT-5.5
88.2
Grok
87.0
Token Benchmark
Claude Opus 4.8
100.0
Claude Sonnet 4.6
66.0
GPT-5.5
48.4
Grok
27.8
Avg Tokens/Test
Claude Opus 4.8
7.6k
Claude Sonnet 4.6
11.5k
GPT-5.5
15.7k
Grok
27.4k
Total Tokens
Claude Opus 4.8
228.3k
Claude Sonnet 4.6
346.1k
GPT-5.5
471.7k
Grok
820.9k
Claude Opus 4.8
100.0BEST
Claude Sonnet 4.6
66.0
GPT-5.5
48.4
Grok
27.8
Claude Opus 4.8
70.3
Claude Sonnet 4.6
70.7BEST
GPT-5.5
63.4
Grok
68.4
Claude Opus 4.8
99.0
Claude Sonnet 4.6
99.7BEST
GPT-5.5
98.3
Grok
97.7
Claude Opus 4.8
96.0
Claude Sonnet 4.6
96.5BEST
GPT-5.5
92.3
Grok
96.1
Claude Opus 4.8
98.0
Claude Sonnet 4.6
96.7
GPT-5.5
99.0BEST
Grok
96.7
Claude Opus 4.8
91.4
Claude Sonnet 4.6
95.8BEST
GPT-5.5
92.3
Grok
95.5
Claude Opus 4.8
98.0
Claude Sonnet 4.6
98.7BEST
GPT-5.5
96.3
Grok
96.7
Claude Opus 4.8
90.9
Claude Sonnet 4.6
95.5
GPT-5.5
96.1BEST
Grok
95.3
Claude Opus 4.8
98.0
Claude Sonnet 4.6
100.0BEST
GPT-5.5
100.0BEST
Grok
100.0BEST
Claude Opus 4.8
97.0
Claude Sonnet 4.6
97.7BEST
GPT-5.5
93.7
Grok
97.7BEST
Claude Opus 4.8
94.3
Claude Sonnet 4.6
95.3BEST
GPT-5.5
90.3
Grok
85.0
| Category | Claude Opus 4.8 | Claude Sonnet 4.6 | GPT-5.5 | Grok |
|---|---|---|---|---|
| Token Efficiency | 100.0 | 66.0 | 48.4 | 27.8 |
| Long Reasoning | 70.3 | 70.7 | 63.4 | 68.4 |
| Coding Tasks | 99.0 | 99.7 | 98.3 | 97.7 |
| Bug Fixes | 96.0 | 96.5 | 92.3 | 96.1 |
| Feature Implementation | 98.0 | 96.7 | 99.0 | 96.7 |
| Code Thoroughness | 91.4 | 95.8 | 92.3 | 95.5 |
| Bug Introduction Rate | 98.0 | 98.7 | 96.3 | 96.7 |
| Security Awareness | 90.9 | 95.5 | 96.1 | 95.3 |
| Instruction Following | 98.0 | 100.0 | 100.0 | 100.0 |
| Code Quality | 97.0 | 97.7 | 93.7 | 97.7 |
| Performance & Efficiency | 94.3 | 95.3 | 90.3 | 85.0 |
| Composite | 93.9 | 92.1 | 88.2 | 87.0 |
| Total Tokens | 228.3k | 346.1k | 471.7k | 820.9k |
| Avg Tokens / Test | 7.6k | 11.5k | 15.7k | 27.4k |
| Token Benchmark | 100.0 | 66.0 | 48.4 | 27.8 |