Back to Dashboard
OpenAI
GPT-5.5
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#3
Token Benchmark
48.4Lower burn, higher score
Total Tokens
471.7k
~15.7k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Instruction Following | 100.0 | 3 | 1.0x | |
| 2 | Feature Implementation | 99.0 | 3 | 1.0x | |
| 3 | Coding Tasks | 98.3 | 3 | 1.0x | |
| 4 | Bug Introduction Rate | 96.3 | 3 | 1.0x | |
| 5 | Security Awareness | 96.1 | 3 | 1.0x | |
| 6 | Code Quality | 93.7 | 3 | 1.0x | |
| 7 | Bug Fixes | 92.3 | 3 | 1.0x | |
| 8 | Code Thoroughness | 92.3 | 3 | 1.0x | |
| 9 | Performance & Efficiency | 90.3 | 3 | 1.0x | |
| 10 | Long Reasoning | 63.4 | 3 | 1.0x | |
| 11 | Token Efficiency | 48.4 | 30 | 1.0x |
Individual Test Results
Token Efficiency48.4
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
15.7k
Total Tokens
471.7k
Long Reasoning63.4
Legal Reasoning Chain
12.8k tok22.5s63.0
Mathematical Proof
12.8k tok26.9s79.9
Multi-step Logic Puzzle
12.7k tok30.8s47.2
Coding Tasks98.3
Graph Algorithm Implementation
12.7k tok25.9s95.0
Concurrent Data Pipeline
12.4k tok17.5s100.0
REST API Design
12.3k tok21.0s100.0
Bug Fixes92.3
Off-by-One Boundary Fix
12.9k tok27.4s77.0
Race Condition Detection
12.4k tok29.9s100.0
Memory Leak Fix
12.8k tok23.1s100.0
OAuth2 Integration
13.0k tok32.4s97.0
Search Autocomplete
13.0k tok31.1s100.0
Webhook System
12.7k tok32.3s100.0
Test Suite Completeness
13.7k tok40.3s87.0
Error Path Completeness
48.2k tok3m 52s98.8
Edge Case Coverage
35.4k tok7m 17s91.0
Refactor Without Regression
14.7k tok56.0s94.0
Merge Conflict Resolution
13.5k tok32.7s97.0
Dependency Upgrade Safety
14.7k tok49.6s98.0
SQL Injection Prevention
13.0k tok31.7s98.8
XSS Mitigation
18.7k tok2m 19s100.0
Secret Management
19.3k tok2m 21s89.6
Structured Output Compliance
12.3k tok23.6s100.0
Multi-step Instruction Chain
12.0k tok18.6s100.0
Constraint Adherence
12.0k tok12.9s100.0
Code Quality93.7
Idiomatic Python
13.8k tok42.9s96.0
TypeScript Best Practices
14.3k tok56.0s87.0
Clean Architecture Patterns
23.4k tok3m 41s98.0
Memory-efficient Processing
14.3k tok59.3s92.0
Query Optimization
13.2k tok33.7s82.0
Algorithm Complexity
12.8k tok26.9s97.0
Regression History
2Security Awarenessminor
Score dropped -4.3% from 92.3 to 88.3
Detected Jun 9, 2026
Long Reasoningmoderate
Score dropped -5.7% from 65.0 to 61.3
Detected Jun 7, 2026