Back to Dashboard
Anthropic

Claude Sonnet 4.6

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

#2

Token Benchmark

66.0

Lower burn, higher score

Total Tokens

346.1k

~11.5k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
96.5
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
30 tests1.0x weight

Individual Test Results

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

11.5k

Total Tokens

346.1k

Legal Reasoning Chain
4.1k tok5m 57s73.0
Multi-step Logic Puzzle
49.1k tok16m 53s47.8
Mathematical Proof
2.9k tok10m 21s91.4
REST API Design
1.1k tok2m 6s100.0
Graph Algorithm Implementation
1.9k tok2m 22s99.0
Concurrent Data Pipeline
3.7k tok13m 15s100.0
Off-by-One Boundary Fix
3.3k tok3m 52s89.4
Race Condition Detection
2.8k tok3m 38s100.0
Memory Leak Fix
3.2k tok8m 60s100.0
OAuth2 Integration
7.4k tok6m 21s90.0
Search Autocomplete
8.6k tok7m 27s100.0
Webhook System
4.8k tok6m 39s100.0
Error Path Completeness
27.4k tok9m 14s99.4
Edge Case Coverage
46.1k tok14m 31s100.0
Test Suite Completeness
11.3k tok5m 44s88.0
Refactor Without Regression
3.8k tok4m 26s100.0
Merge Conflict Resolution
4.8k tok2m 30s97.0
Dependency Upgrade Safety
13.7k tok4m 53s99.0
SQL Injection Prevention
3.6k tok1m 31s100.0
XSS Mitigation
5.6k tok5m 6s100.0
Secret Management
4.3k tok3m 9s86.6
Structured Output Compliance
908 tok2m 21s100.0
Multi-step Instruction Chain
1.3k tok51.5s100.0
Constraint Adherence
716 tok1m 18s100.0
Idiomatic Python
14.3k tok3m 23s98.0
TypeScript Best Practices
42.1k tok11m 2s97.0
Clean Architecture Patterns
30.8k tok9m 19s98.0
Algorithm Complexity
4.8k tok3m 26s100.0
Query Optimization
9.2k tok2m 33s87.0
Memory-efficient Processing
28.6k tok8m 46s99.0

Regression History

2
Token Efficiencymajor

Score dropped -24.8% from 98.8 to 74.3

Detected Jun 8, 2026
Token Efficiencyminorresolved

Score dropped -4.7% from 100.0 to 95.3

Detected Jun 7, 2026·Resolved Jun 7, 2026

Outage History

1
error

Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected