Back to Dashboard
CategoryWeight: 1.0x

Token Efficiency

Measures how efficiently a model solves tasks by penalizing higher token consumption. Lower usage earns a higher score.

Best Score

0.0

Avg Score

0.0

Measured Models

4

Performance Over Time — All Models

Model Rankings

1
Claude Opus 4.8

Token usage benchmark

View
100.0BEST
Avg/Test7.6k/test
Total228.3k
2
Claude Sonnet 4.6

Token usage benchmark

View
66.0-34.0 pts
Avg/Test11.5k/test
Total346.1k
3
GPT-5.5

Token usage benchmark

View
48.4-51.6 pts
Avg/Test15.7k/test
Total471.7k
4
Grok

Token usage benchmark

View
27.8-72.2 pts
Avg/Test27.4k/test
Total820.9k

Benchmark Construction

How It Scores

We total prompt and completion tokens across all successful benchmark tasks, compute an average per successful task, then assign 100 to the lowest-burn model in that run. Everyone else is scaled down proportionally, so higher usage means a lower benchmark score.

What To Read In This View

The ranking table above is the benchmark itself. Use Avg/Test to compare per-task burn and Total to spot larger absolute usage across the whole run.

Historical scores show whether a model is becoming more or less token-efficient over time, independent of raw quality improvements in the other ten categories.