Anthropic

Claude Sonnet 4.6

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

66.0

Lower burn, higher score

Total Tokens

346.1k

~11.5k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Instruction Following

100.0

3 tests1.0x weight

Coding Tasks

99.7

3 tests1.0x weight

Bug Introduction Rate

98.7

3 tests1.0x weight

Code Quality

97.7

3 tests1.0x weight

Feature Implementation

96.7

3 tests1.0x weight

Bug Fixes

96.5

3 tests1.0x weight

Code Thoroughness

95.8

3 tests1.0x weight

Security Awareness

95.5

3 tests1.0x weight

Performance & Efficiency

95.3

3 tests1.0x weight

#10

Long Reasoning

70.7

3 tests1.0x weight

#11

Token Efficiency

66.0

30 tests1.0x weight

#	Category	Score	Tests	Weight
1	Instruction Following	100.0	3	1.0x
2	Coding Tasks	99.7	3	1.0x
3	Bug Introduction Rate	98.7	3	1.0x
4	Code Quality	97.7	3	1.0x
5	Feature Implementation	96.7	3	1.0x
6	Bug Fixes	96.5	3	1.0x
7	Code Thoroughness	95.8	3	1.0x
8	Security Awareness	95.5	3	1.0x
9	Performance & Efficiency	95.3	3	1.0x
10	Long Reasoning	70.7	3	1.0x
11	Token Efficiency	66.0	30	1.0x

Individual Test Results

Token Efficiency66.0

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

11.5k

Total Tokens

346.1k

Long Reasoning70.7

Legal Reasoning Chain

4.1k tok5m 57s73.0

Multi-step Logic Puzzle

49.1k tok16m 53s47.8

Mathematical Proof

2.9k tok10m 21s91.4

Coding Tasks99.7

REST API Design

1.1k tok2m 6s100.0

Graph Algorithm Implementation

1.9k tok2m 22s99.0

Concurrent Data Pipeline

3.7k tok13m 15s100.0

Bug Fixes96.5

Off-by-One Boundary Fix

3.3k tok3m 52s89.4

Race Condition Detection

2.8k tok3m 38s100.0

Memory Leak Fix

3.2k tok8m 60s100.0

Feature Implementation96.7

OAuth2 Integration

7.4k tok6m 21s90.0

Search Autocomplete

8.6k tok7m 27s100.0

Webhook System

4.8k tok6m 39s100.0

Code Thoroughness95.8

Error Path Completeness

27.4k tok9m 14s99.4

Edge Case Coverage

46.1k tok14m 31s100.0

Test Suite Completeness

11.3k tok5m 44s88.0

Bug Introduction Rate98.7

Refactor Without Regression

3.8k tok4m 26s100.0

Merge Conflict Resolution

4.8k tok2m 30s97.0

Dependency Upgrade Safety

13.7k tok4m 53s99.0

Security Awareness95.5

SQL Injection Prevention

3.6k tok1m 31s100.0

XSS Mitigation

5.6k tok5m 6s100.0

Secret Management

4.3k tok3m 9s86.6

Instruction Following100.0

Structured Output Compliance

908 tok2m 21s100.0

Multi-step Instruction Chain

1.3k tok51.5s100.0

Constraint Adherence

716 tok1m 18s100.0

Code Quality97.7

Idiomatic Python

14.3k tok3m 23s98.0

TypeScript Best Practices

42.1k tok11m 2s97.0

Clean Architecture Patterns

30.8k tok9m 19s98.0

Performance & Efficiency95.3

Algorithm Complexity

4.8k tok3m 26s100.0

Query Optimization

9.2k tok2m 33s87.0

Memory-efficient Processing

28.6k tok8m 46s99.0

Regression History

Token Efficiencymajor

Score dropped -24.8% from 98.8 to 74.3

Detected Jun 8, 2026

Token Efficiencyminorresolved

Score dropped -4.7% from 100.0 to 95.3

Detected Jun 7, 2026·Resolved Jun 7, 2026

Outage History

error

Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected