Top Performing Model

Category	Claude Opus 4.8	Claude Sonnet 4.6	GPT-5.5	Grok
Token Efficiency	100.0	66.0	48.4	27.8
Long Reasoning	70.3	70.7	63.4	68.4
Coding Tasks	99.0	99.7	98.3	97.7
Bug Fixes	96.0	96.5	92.3	96.1
Feature Implementation	98.0	96.7	99.0	96.7
Code Thoroughness	91.4	95.8	92.3	95.5
Bug Introduction Rate	98.0	98.7	96.3	96.7
Security Awareness	90.9	95.5	96.1	95.3
Instruction Following	98.0	100.0	100.0	100.0
Code Quality	97.0	97.7	93.7	97.7
Performance & Efficiency	94.3	95.3	90.3	85.0

Latest Benchmark Run

Jun 10, 4:44 AMdaily

Claude Opus 4.8

Composite benchmark summary

Composite

93.9

Token Benchmark

100.0

Total tokens

228.3k

~7.6k/test

Best category100.0 Token Efficiency

Worst category70.3 Long Reasoning

View details

Claude Sonnet 4.6

Composite benchmark summary

Composite

92.1

Token Benchmark

66.0

Total tokens

346.1k

~11.5k/test

Best category100.0 Instruction Following

Worst category66.0 Token Efficiency

View details

GPT-5.5

Composite benchmark summary

Composite

88.2

Token Benchmark

48.4

Total tokens

471.7k

~15.7k/test

Best category100.0 Instruction Following

Worst category48.4 Token Efficiency

View details

Grok

Composite benchmark summary

Composite

87.0

Token Benchmark

27.8

Total tokens

820.9k

~27.4k/test

Best category100.0 Instruction Following

Worst category27.8 Token Efficiency

View details

Model	Composite	Rank	Best Category	Worst Category	Tokens	Details
Claude Opus 4.8	93.9	#1	100.0Token Efficiency	70.3Long Reasoning	228.3k	View
Claude Sonnet 4.6	92.1	#2	100.0Instruction Following	66.0Token Efficiency	346.1k	View
GPT-5.5	88.2	#3	100.0Instruction Following	48.4Token Efficiency	471.7k	View
Grok	87.0	#4	100.0Instruction Following	27.8Token Efficiency	820.9k	View