Back to Dashboard
CategoryWeight: 1.0x

Bug Introduction Rate

Measures how often the model introduces new bugs while writing or modifying code. Lower is better (inverted for scoring).

Best Score

0.0

Avg Score

0.0

Tests

3

Performance Over Time — All Models

Model Rankings

1
Claude Sonnet 4.6

Category score

View
98.7BEST
Tokens22.2k
Total22.2k
2
Claude Opus 4.8

Category score

View
98.0-0.7 pts
Tokens15.3k
Total15.3k
3
Grok

Category score

View
96.7-2.0 pts
Tokens86.4k
Total86.4k
4
GPT-5.5

Category score

View
96.3-2.4 pts
Tokens42.9k
Total42.9k

Test Breakdown

Refactor Without Regression

Refactor a function without introducing new failures in existing tests

Claude Sonnet 4.6
98.7
Claude Opus 4.8
98.0
Grok
96.7
GPT-5.5
96.3

Merge Conflict Resolution

Resolve merge conflicts without introducing semantic errors

Claude Sonnet 4.6
98.7
Claude Opus 4.8
98.0
Grok
96.7
GPT-5.5
96.3

Dependency Upgrade Safety

Upgrade a dependency and adapt code without breaking changes

Claude Sonnet 4.6
98.7
Claude Opus 4.8
98.0
Grok
96.7
GPT-5.5
96.3