Back to Dashboard
CategoryWeight: 1.0x
Bug Introduction Rate
Measures how often the model introduces new bugs while writing or modifying code. Lower is better (inverted for scoring).
Best Score
0.0Avg Score
0.0Tests
3Performance Over Time — All Models
Model Rankings
Test Breakdown
Refactor Without Regression
Refactor a function without introducing new failures in existing tests
Claude Sonnet 4.6
98.7Claude Opus 4.8
98.0Grok
96.7GPT-5.5
96.3Merge Conflict Resolution
Resolve merge conflicts without introducing semantic errors
Claude Sonnet 4.6
98.7Claude Opus 4.8
98.0Grok
96.7GPT-5.5
96.3Dependency Upgrade Safety
Upgrade a dependency and adapt code without breaking changes
Claude Sonnet 4.6
98.7Claude Opus 4.8
98.0Grok
96.7GPT-5.5
96.3