Back to Dashboard
OpenAI

GPT-5.5

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

#3

Token Benchmark

48.4

Lower burn, higher score

Total Tokens

471.7k

~15.7k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
92.3
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
3 tests1.0x weight
30 tests1.0x weight

Individual Test Results

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

15.7k

Total Tokens

471.7k

Legal Reasoning Chain
12.8k tok22.5s63.0
Mathematical Proof
12.8k tok26.9s79.9
Multi-step Logic Puzzle
12.7k tok30.8s47.2
Graph Algorithm Implementation
12.7k tok25.9s95.0
Concurrent Data Pipeline
12.4k tok17.5s100.0
REST API Design
12.3k tok21.0s100.0
Off-by-One Boundary Fix
12.9k tok27.4s77.0
Race Condition Detection
12.4k tok29.9s100.0
Memory Leak Fix
12.8k tok23.1s100.0
OAuth2 Integration
13.0k tok32.4s97.0
Search Autocomplete
13.0k tok31.1s100.0
Webhook System
12.7k tok32.3s100.0
Test Suite Completeness
13.7k tok40.3s87.0
Error Path Completeness
48.2k tok3m 52s98.8
Edge Case Coverage
35.4k tok7m 17s91.0
Refactor Without Regression
14.7k tok56.0s94.0
Merge Conflict Resolution
13.5k tok32.7s97.0
Dependency Upgrade Safety
14.7k tok49.6s98.0
SQL Injection Prevention
13.0k tok31.7s98.8
XSS Mitigation
18.7k tok2m 19s100.0
Secret Management
19.3k tok2m 21s89.6
Structured Output Compliance
12.3k tok23.6s100.0
Multi-step Instruction Chain
12.0k tok18.6s100.0
Constraint Adherence
12.0k tok12.9s100.0
Idiomatic Python
13.8k tok42.9s96.0
TypeScript Best Practices
14.3k tok56.0s87.0
Clean Architecture Patterns
23.4k tok3m 41s98.0
Memory-efficient Processing
14.3k tok59.3s92.0
Query Optimization
13.2k tok33.7s82.0
Algorithm Complexity
12.8k tok26.9s97.0

Regression History

2
Security Awarenessminor

Score dropped -4.3% from 92.3 to 88.3

Detected Jun 9, 2026
Long Reasoningmoderate

Score dropped -5.7% from 65.0 to 61.3

Detected Jun 7, 2026