Back to Dashboard
CategoryWeight: 1.0x

Code Thoroughness

Evaluates completeness of generated code: edge case handling, input validation, error paths, and test coverage.

Best Score

0.0

Avg Score

0.0

Tests

3

Performance Over Time — All Models

Model Rankings

1
Claude Sonnet 4.6

Category score

View
95.8BEST
Tokens84.8k
Total84.8k
2
Grok

Category score

View
95.5-0.3 pts
Tokens158.2k
Total158.2k
3
GPT-5.5

Category score

View
92.3-3.5 pts
Tokens97.2k
Total97.2k
4
Claude Opus 4.8

Category score

View
91.4-4.4 pts
Tokens44.6k
Total44.6k

Test Breakdown

Edge Case Coverage

Generate code handling null, empty, unicode, and overflow inputs

Claude Sonnet 4.6
95.8
Grok
95.5
GPT-5.5
92.3
Claude Opus 4.8
91.4

Error Path Completeness

Ensure all failure modes have proper error handling and logging

Claude Sonnet 4.6
95.8
Grok
95.5
GPT-5.5
92.3
Claude Opus 4.8
91.4

Test Suite Completeness

Generate tests covering happy path, edge cases, and integration

Claude Sonnet 4.6
95.8
Grok
95.5
GPT-5.5
92.3
Claude Opus 4.8
91.4