Back to Dashboard
Anthropic
Claude Opus 4.8
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#1
Token Benchmark
100.0Lower burn, higher score
Total Tokens
228.3k
~7.6k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Token Efficiency | 100.0 | 30 | 1.0x | |
| 2 | Coding Tasks | 99.0 | 3 | 1.0x | |
| 3 | Feature Implementation | 98.0 | 3 | 1.0x | |
| 4 | Bug Introduction Rate | 98.0 | 3 | 1.0x | |
| 5 | Instruction Following | 98.0 | 3 | 1.0x | |
| 6 | Code Quality | 97.0 | 3 | 1.0x | |
| 7 | Bug Fixes | 96.0 | 3 | 1.0x | |
| 8 | Performance & Efficiency | 94.3 | 3 | 1.0x | |
| 9 | Code Thoroughness | 91.4 | 3 | 1.0x | |
| 10 | Security Awareness | 90.9 | 3 | 1.0x | |
| 11 | Long Reasoning | 70.3 | 3 | 1.0x |
Individual Test Results
Token Efficiency100.0
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
7.6k
Total Tokens
228.3k
Long Reasoning70.3
Legal Reasoning Chain
6.1k tok50.7s88.0
Multi-step Logic Puzzle
7.9k tok1m 7s47.2
Mathematical Proof
5.1k tok10m 60s75.7
Coding Tasks99.0
Graph Algorithm Implementation
5.7k tok5m 30s97.0
REST API Design
4.3k tok5m 40s100.0
Concurrent Data Pipeline
5.9k tok6m 36s100.0
Bug Fixes96.0
Off-by-One Boundary Fix
4.1k tok4m 34s90.0
Memory Leak Fix
4.9k tok2m 48s98.0
Race Condition Detection
4.1k tok4m 36s100.0
OAuth2 Integration
4.5k tok3m 18s95.0
Search Autocomplete
7.2k tok4m 31s100.0
Webhook System
4.2k tok2m 59s99.0
Edge Case Coverage
17.5k tok6m 45s94.0
Error Path Completeness
21.1k tok7m 54s98.2
Test Suite Completeness
6.0k tok12m 6s82.0
Refactor Without Regression
4.6k tok6m 53s98.0
Merge Conflict Resolution
4.2k tok2m 31s97.0
Dependency Upgrade Safety
6.5k tok9m 30s99.0
SQL Injection Prevention
5.2k tok13m 48s85.0
XSS Mitigation
6.4k tok4m 2s100.0
Secret Management
11.3k tok3m 6s87.8
Structured Output Compliance
2.8k tok1m 57s100.0
Multi-step Instruction Chain
2.6k tok1m 41s100.0
Constraint Adherence
2.4k tok2m 9s94.0
Code Quality97.0
Idiomatic Python
15.4k tok2m 24s97.0
TypeScript Best Practices
18.6k tok5m 56s97.0
Clean Architecture Patterns
13.8k tok5m 38s97.0
Algorithm Complexity
4.1k tok1m 9s98.0
Query Optimization
6.6k tok3m 45s88.0
Memory-efficient Processing
15.3k tok14m 47s97.0
Regression History
1Long Reasoningminor
Score dropped -3.9% from 66.5 to 63.9
Detected Jun 9, 2026
Outage History
2error
Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected
timeout
Started Jun 7, 4:01 AM·Ended Jun 7, 4:30 AM· checks affected