CategoryWeight: 1.0x

Bug Introduction Rate

Measures how often the model introduces new bugs while writing or modifying code. Lower is better (inverted for scoring).

Best Score

0.0

Avg Score

0.0

Tests

Performance Over Time — All Models

Model Rankings

Claude Sonnet 4.6

Category score

View

98.7BEST

Tokens22.2k

Total22.2k

Claude Opus 4.8

Category score

View

98.0-0.7 pts

Tokens15.3k

Total15.3k

Grok

Category score

View

96.7-2.0 pts

Tokens86.4k

Total86.4k

GPT-5.5

Category score

View

96.3-2.4 pts

Tokens42.9k

Total42.9k

Rank	Model	Score	Tokens	vs. Best	Details
1	Claude Sonnet 4.6	98.7	22.2k	BEST	View
2	Claude Opus 4.8	98.0	15.3k	-0.7 pts	View
3	Grok	96.7	86.4k	-2.0 pts	View
4	GPT-5.5	96.3	42.9k	-2.4 pts	View

Test Breakdown

Refactor Without Regression

Refactor a function without introducing new failures in existing tests

Claude Sonnet 4.6

98.7

Claude Opus 4.8

98.0

Grok

96.7

GPT-5.5

96.3

Merge Conflict Resolution

Resolve merge conflicts without introducing semantic errors

Claude Sonnet 4.6

98.7

Claude Opus 4.8

98.0

Grok

96.7

GPT-5.5

96.3

Dependency Upgrade Safety

Upgrade a dependency and adapt code without breaking changes

Claude Sonnet 4.6

98.7

Claude Opus 4.8

98.0

Grok

96.7

GPT-5.5

96.3