Compare Models

Side-by-side performance comparison across all benchmark categories.

Select Models to Compare

Category Radar Comparison

The radar visualization is shown on wider screens. On mobile, use the detailed comparison cards below for exact per-category, composite, and token-efficiency values without clipped labels.

Performance Over Time

Detailed Score Comparison

Composite

Claude Opus 4.8

93.9

Claude Sonnet 4.6

92.1

GPT-5.5

88.2

Grok

87.0

Token Benchmark

Claude Opus 4.8

100.0

Claude Sonnet 4.6

66.0

GPT-5.5

48.4

Grok

27.8

Avg Tokens/Test

Claude Opus 4.8

7.6k

Claude Sonnet 4.6

11.5k

GPT-5.5

15.7k

Grok

27.4k

Total Tokens

Claude Opus 4.8

228.3k

Claude Sonnet 4.6

346.1k

GPT-5.5

471.7k

Grok

820.9k

Token EfficiencyCategory

Claude Opus 4.8

100.0BEST

Claude Sonnet 4.6

66.0

GPT-5.5

48.4

Grok

27.8

Long ReasoningCategory

Claude Opus 4.8

70.3

Claude Sonnet 4.6

70.7BEST

GPT-5.5

63.4

Grok

68.4

Coding TasksCategory

Claude Opus 4.8

99.0

Claude Sonnet 4.6

99.7BEST

GPT-5.5

98.3

Grok

97.7

Bug FixesCategory

Claude Opus 4.8

96.0

Claude Sonnet 4.6

96.5BEST

GPT-5.5

92.3

Grok

96.1

Feature ImplementationCategory

Claude Opus 4.8

98.0

Claude Sonnet 4.6

96.7

GPT-5.5

99.0BEST

Grok

96.7

Code ThoroughnessCategory

Claude Opus 4.8

91.4

Claude Sonnet 4.6

95.8BEST

GPT-5.5

92.3

Grok

95.5

Bug Introduction RateCategory

Claude Opus 4.8

98.0

Claude Sonnet 4.6

98.7BEST

GPT-5.5

96.3

Grok

96.7

Security AwarenessCategory

Claude Opus 4.8

90.9

Claude Sonnet 4.6

95.5

GPT-5.5

96.1BEST

Grok

95.3

Instruction FollowingCategory

Claude Opus 4.8

98.0

Claude Sonnet 4.6

100.0BEST

GPT-5.5

100.0BEST

Grok

100.0BEST

Code QualityCategory

Claude Opus 4.8

97.0

Claude Sonnet 4.6

97.7BEST

GPT-5.5

93.7

Grok

97.7BEST

Performance & EfficiencyCategory

Claude Opus 4.8

94.3

Claude Sonnet 4.6

95.3BEST

GPT-5.5

90.3

Grok

85.0