Top Performing Model
Based on composite benchmark scores
Claude Opus 4.8
Leading today's benchmarks
Performance Timeline
Active Regressions
5Security Awareness dropped -4.3% from 92.3 to 88.3
Detected Jun 9, 2026 · 7-day window
Long Reasoning dropped -3.9% from 66.5 to 63.9
Detected Jun 9, 2026 · 7-day window
Token Efficiency dropped -24.8% from 98.8 to 74.3
Detected Jun 8, 2026 · 7-day window
Long Reasoning dropped -5.7% from 65.0 to 61.3
Detected Jun 7, 2026 · 7-day window
Token Efficiency dropped -78.5% from 61.5 to 13.2
Detected Jun 6, 2026 · 7-day window
Category Performance Heatmap
Latest Benchmark Run
Composite benchmark summary
Composite
93.9
Token Benchmark
100.0
228.3k
~7.6k/test
Composite benchmark summary
Composite
92.1
Token Benchmark
66.0
346.1k
~11.5k/test
Composite benchmark summary
Composite
88.2
Token Benchmark
48.4
471.7k
~15.7k/test
Composite benchmark summary
Composite
87.0
Token Benchmark
27.8
820.9k
~27.4k/test
| Model | Composite | Rank | Best Category | Worst Category | Tokens | Details |
|---|---|---|---|---|---|---|
Claude Opus 4.8 | 93.9 | #1 | 100.0Token Efficiency | 70.3Long Reasoning | 228.3k | View |
Claude Sonnet 4.6 | 92.1 | #2 | 100.0Instruction Following | 66.0Token Efficiency | 346.1k | View |
GPT-5.5 | 88.2 | #3 | 100.0Instruction Following | 48.4Token Efficiency | 471.7k | View |
Grok | 87.0 | #4 | 100.0Instruction Following | 27.8Token Efficiency | 820.9k | View |