OpenAI

GPT-5.5

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

48.4

Lower burn, higher score

Total Tokens

471.7k

~15.7k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Instruction Following

100.0

3 tests1.0x weight

Feature Implementation

99.0

3 tests1.0x weight

Coding Tasks

98.3

3 tests1.0x weight

Bug Introduction Rate

96.3

3 tests1.0x weight

Security Awareness

96.1

3 tests1.0x weight

Code Quality

93.7

3 tests1.0x weight

Bug Fixes

92.3

3 tests1.0x weight

Code Thoroughness

92.3

3 tests1.0x weight

Performance & Efficiency

90.3

3 tests1.0x weight

#10

Long Reasoning

63.4

3 tests1.0x weight

#11

Token Efficiency

48.4

30 tests1.0x weight

#	Category	Score	Tests	Weight
1	Instruction Following	100.0	3	1.0x
2	Feature Implementation	99.0	3	1.0x
3	Coding Tasks	98.3	3	1.0x
4	Bug Introduction Rate	96.3	3	1.0x
5	Security Awareness	96.1	3	1.0x
6	Code Quality	93.7	3	1.0x
7	Bug Fixes	92.3	3	1.0x
8	Code Thoroughness	92.3	3	1.0x
9	Performance & Efficiency	90.3	3	1.0x
10	Long Reasoning	63.4	3	1.0x
11	Token Efficiency	48.4	30	1.0x

Individual Test Results

Token Efficiency48.4

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

15.7k

Total Tokens

471.7k

Long Reasoning63.4

Legal Reasoning Chain

12.8k tok22.5s63.0

Mathematical Proof

12.8k tok26.9s79.9

Multi-step Logic Puzzle

12.7k tok30.8s47.2

Coding Tasks98.3

Graph Algorithm Implementation

12.7k tok25.9s95.0

Concurrent Data Pipeline

12.4k tok17.5s100.0

REST API Design

12.3k tok21.0s100.0

Bug Fixes92.3

Off-by-One Boundary Fix

12.9k tok27.4s77.0

Race Condition Detection

12.4k tok29.9s100.0

Memory Leak Fix

12.8k tok23.1s100.0

Feature Implementation99.0

OAuth2 Integration

13.0k tok32.4s97.0

Search Autocomplete

13.0k tok31.1s100.0

Webhook System

12.7k tok32.3s100.0

Code Thoroughness92.3

Test Suite Completeness

13.7k tok40.3s87.0

Error Path Completeness

48.2k tok3m 52s98.8

Edge Case Coverage

35.4k tok7m 17s91.0

Bug Introduction Rate96.3

Refactor Without Regression

14.7k tok56.0s94.0

Merge Conflict Resolution

13.5k tok32.7s97.0

Dependency Upgrade Safety

14.7k tok49.6s98.0

Security Awareness96.1

SQL Injection Prevention

13.0k tok31.7s98.8

XSS Mitigation

18.7k tok2m 19s100.0

Secret Management

19.3k tok2m 21s89.6

Instruction Following100.0

Structured Output Compliance

12.3k tok23.6s100.0

Multi-step Instruction Chain

12.0k tok18.6s100.0

Constraint Adherence

12.0k tok12.9s100.0

Code Quality93.7

Idiomatic Python

13.8k tok42.9s96.0

TypeScript Best Practices

14.3k tok56.0s87.0

Clean Architecture Patterns

23.4k tok3m 41s98.0

Performance & Efficiency90.3

Memory-efficient Processing

14.3k tok59.3s92.0

Query Optimization

13.2k tok33.7s82.0

Algorithm Complexity

12.8k tok26.9s97.0

Regression History

Security Awarenessminor

Score dropped -4.3% from 92.3 to 88.3

Detected Jun 9, 2026

Long Reasoningmoderate

Score dropped -5.7% from 65.0 to 61.3

Detected Jun 7, 2026