Anthropic

Claude Opus 4.8

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

100.0

Lower burn, higher score

Total Tokens

228.3k

~7.6k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Token Efficiency

100.0

30 tests1.0x weight

Coding Tasks

99.0

3 tests1.0x weight

Feature Implementation

98.0

3 tests1.0x weight

Bug Introduction Rate

98.0

3 tests1.0x weight

Instruction Following

98.0

3 tests1.0x weight

Code Quality

97.0

3 tests1.0x weight

Bug Fixes

96.0

3 tests1.0x weight

Performance & Efficiency

94.3

3 tests1.0x weight

Code Thoroughness

91.4

3 tests1.0x weight

#10

Security Awareness

90.9

3 tests1.0x weight

#11

Long Reasoning

70.3

3 tests1.0x weight

#	Category	Score	Tests	Weight
1	Token Efficiency	100.0	30	1.0x
2	Coding Tasks	99.0	3	1.0x
3	Feature Implementation	98.0	3	1.0x
4	Bug Introduction Rate	98.0	3	1.0x
5	Instruction Following	98.0	3	1.0x
6	Code Quality	97.0	3	1.0x
7	Bug Fixes	96.0	3	1.0x
8	Performance & Efficiency	94.3	3	1.0x
9	Code Thoroughness	91.4	3	1.0x
10	Security Awareness	90.9	3	1.0x
11	Long Reasoning	70.3	3	1.0x

Individual Test Results

Token Efficiency100.0

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

7.6k

Total Tokens

228.3k

Long Reasoning70.3

Legal Reasoning Chain

6.1k tok50.7s88.0

Multi-step Logic Puzzle

7.9k tok1m 7s47.2

Mathematical Proof

5.1k tok10m 60s75.7

Coding Tasks99.0

Graph Algorithm Implementation

5.7k tok5m 30s97.0

REST API Design

4.3k tok5m 40s100.0

Concurrent Data Pipeline

5.9k tok6m 36s100.0

Bug Fixes96.0

Off-by-One Boundary Fix

4.1k tok4m 34s90.0

Memory Leak Fix

4.9k tok2m 48s98.0

Race Condition Detection

4.1k tok4m 36s100.0

Feature Implementation98.0

OAuth2 Integration

4.5k tok3m 18s95.0

Search Autocomplete

7.2k tok4m 31s100.0

Webhook System

4.2k tok2m 59s99.0

Code Thoroughness91.4

Edge Case Coverage

17.5k tok6m 45s94.0

Error Path Completeness

21.1k tok7m 54s98.2

Test Suite Completeness

6.0k tok12m 6s82.0

Bug Introduction Rate98.0

Refactor Without Regression

4.6k tok6m 53s98.0

Merge Conflict Resolution

4.2k tok2m 31s97.0

Dependency Upgrade Safety

6.5k tok9m 30s99.0

Security Awareness90.9

SQL Injection Prevention

5.2k tok13m 48s85.0

XSS Mitigation

6.4k tok4m 2s100.0

Secret Management

11.3k tok3m 6s87.8

Instruction Following98.0

Structured Output Compliance

2.8k tok1m 57s100.0

Multi-step Instruction Chain

2.6k tok1m 41s100.0

Constraint Adherence

2.4k tok2m 9s94.0

Code Quality97.0

Idiomatic Python

15.4k tok2m 24s97.0

TypeScript Best Practices

18.6k tok5m 56s97.0

Clean Architecture Patterns

13.8k tok5m 38s97.0

Performance & Efficiency94.3

Algorithm Complexity

4.1k tok1m 9s98.0

Query Optimization

6.6k tok3m 45s88.0

Memory-efficient Processing

15.3k tok14m 47s97.0

Regression History

Long Reasoningminor

Score dropped -3.9% from 66.5 to 63.9

Detected Jun 9, 2026

Outage History

error

Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected

timeout

Started Jun 7, 4:01 AM·Ended Jun 7, 4:30 AM· checks affected