CategoryWeight: 1.0x

Token Efficiency

Measures how efficiently a model solves tasks by penalizing higher token consumption. Lower usage earns a higher score.

Best Score

0.0

Avg Score

0.0

Measured Models

Performance Over Time — All Models

Model Rankings

Claude Opus 4.8

Token usage benchmark

View

100.0BEST

Avg/Test7.6k/test

Total228.3k

Claude Sonnet 4.6

Token usage benchmark

View

66.0-34.0 pts

Avg/Test11.5k/test

Total346.1k

GPT-5.5

Token usage benchmark

View

48.4-51.6 pts

Avg/Test15.7k/test

Total471.7k

Grok

Token usage benchmark

View

27.8-72.2 pts

Avg/Test27.4k/test

Total820.9k

Rank	Model	Score	Avg/Test	vs. Best	Details
1	Claude Opus 4.8	100.0	7.6k/test	BEST	View
2	Claude Sonnet 4.6	66.0	11.5k/test	-34.0 pts	View
3	GPT-5.5	48.4	15.7k/test	-51.6 pts	View
4	Grok	27.8	27.4k/test	-72.2 pts	View

Benchmark Construction

How It Scores

We total prompt and completion tokens across all successful benchmark tasks, compute an average per successful task, then assign 100 to the lowest-burn model in that run. Everyone else is scaled down proportionally, so higher usage means a lower benchmark score.

What To Read In This View

The ranking table above is the benchmark itself. Use Avg/Test to compare per-task burn and Total to spot larger absolute usage across the whole run.

Historical scores show whether a model is becoming more or less token-efficient over time, independent of raw quality improvements in the other ten categories.