CategoryWeight: 1.0x

Code Thoroughness

Evaluates completeness of generated code: edge case handling, input validation, error paths, and test coverage.

Best Score

0.0

Avg Score

0.0

Tests

Performance Over Time — All Models

Model Rankings

Claude Sonnet 4.6

Category score

View

95.8BEST

Tokens84.8k

Total84.8k

Grok

Category score

View

95.5-0.3 pts

Tokens158.2k

Total158.2k

GPT-5.5

Category score

View

92.3-3.5 pts

Tokens97.2k

Total97.2k

Claude Opus 4.8

Category score

View

91.4-4.4 pts

Tokens44.6k

Total44.6k

Rank	Model	Score	Tokens	vs. Best	Details
1	Claude Sonnet 4.6	95.8	84.8k	BEST	View
2	Grok	95.5	158.2k	-0.3 pts	View
3	GPT-5.5	92.3	97.2k	-3.5 pts	View
4	Claude Opus 4.8	91.4	44.6k	-4.4 pts	View

Test Breakdown

Edge Case Coverage

Generate code handling null, empty, unicode, and overflow inputs

Claude Sonnet 4.6

95.8

Grok

95.5

GPT-5.5

92.3

Claude Opus 4.8

91.4

Error Path Completeness

Ensure all failure modes have proper error handling and logging

Claude Sonnet 4.6

95.8

Grok

95.5

GPT-5.5

92.3

Claude Opus 4.8

91.4

Test Suite Completeness

Generate tests covering happy path, edge cases, and integration

Claude Sonnet 4.6

95.8

Grok

95.5

GPT-5.5

92.3

Claude Opus 4.8

91.4