Back to Dashboard
Anthropic
Claude Sonnet 4.6
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#2
Token Benchmark
66.0Lower burn, higher score
Total Tokens
346.1k
~11.5k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Instruction Following | 100.0 | 3 | 1.0x | |
| 2 | Coding Tasks | 99.7 | 3 | 1.0x | |
| 3 | Bug Introduction Rate | 98.7 | 3 | 1.0x | |
| 4 | Code Quality | 97.7 | 3 | 1.0x | |
| 5 | Feature Implementation | 96.7 | 3 | 1.0x | |
| 6 | Bug Fixes | 96.5 | 3 | 1.0x | |
| 7 | Code Thoroughness | 95.8 | 3 | 1.0x | |
| 8 | Security Awareness | 95.5 | 3 | 1.0x | |
| 9 | Performance & Efficiency | 95.3 | 3 | 1.0x | |
| 10 | Long Reasoning | 70.7 | 3 | 1.0x | |
| 11 | Token Efficiency | 66.0 | 30 | 1.0x |
Individual Test Results
Token Efficiency66.0
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
11.5k
Total Tokens
346.1k
Long Reasoning70.7
Legal Reasoning Chain
4.1k tok5m 57s73.0
Multi-step Logic Puzzle
49.1k tok16m 53s47.8
Mathematical Proof
2.9k tok10m 21s91.4
Coding Tasks99.7
REST API Design
1.1k tok2m 6s100.0
Graph Algorithm Implementation
1.9k tok2m 22s99.0
Concurrent Data Pipeline
3.7k tok13m 15s100.0
Bug Fixes96.5
Off-by-One Boundary Fix
3.3k tok3m 52s89.4
Race Condition Detection
2.8k tok3m 38s100.0
Memory Leak Fix
3.2k tok8m 60s100.0
OAuth2 Integration
7.4k tok6m 21s90.0
Search Autocomplete
8.6k tok7m 27s100.0
Webhook System
4.8k tok6m 39s100.0
Error Path Completeness
27.4k tok9m 14s99.4
Edge Case Coverage
46.1k tok14m 31s100.0
Test Suite Completeness
11.3k tok5m 44s88.0
Refactor Without Regression
3.8k tok4m 26s100.0
Merge Conflict Resolution
4.8k tok2m 30s97.0
Dependency Upgrade Safety
13.7k tok4m 53s99.0
SQL Injection Prevention
3.6k tok1m 31s100.0
XSS Mitigation
5.6k tok5m 6s100.0
Secret Management
4.3k tok3m 9s86.6
Structured Output Compliance
908 tok2m 21s100.0
Multi-step Instruction Chain
1.3k tok51.5s100.0
Constraint Adherence
716 tok1m 18s100.0
Code Quality97.7
Idiomatic Python
14.3k tok3m 23s98.0
TypeScript Best Practices
42.1k tok11m 2s97.0
Clean Architecture Patterns
30.8k tok9m 19s98.0
Algorithm Complexity
4.8k tok3m 26s100.0
Query Optimization
9.2k tok2m 33s87.0
Memory-efficient Processing
28.6k tok8m 46s99.0
Regression History
2Token Efficiencymajor
Score dropped -24.8% from 98.8 to 74.3
Detected Jun 8, 2026
Token Efficiencyminorresolved
Score dropped -4.7% from 100.0 to 95.3
Detected Jun 7, 2026·Resolved Jun 7, 2026
Outage History
1error
Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected