xAI

Grok

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

27.8

Lower burn, higher score

Total Tokens

820.9k

~27.4k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Instruction Following

100.0

3 tests1.0x weight

Coding Tasks

97.7

3 tests1.0x weight

Code Quality

97.7

3 tests1.0x weight

Feature Implementation

96.7

3 tests1.0x weight

Bug Introduction Rate

96.7

3 tests1.0x weight

Bug Fixes

96.1

3 tests1.0x weight

Code Thoroughness

95.5

3 tests1.0x weight

Security Awareness

95.3

3 tests1.0x weight

Performance & Efficiency

85.0

3 tests1.0x weight

#10

Long Reasoning

68.4

3 tests1.0x weight

#11

Token Efficiency

27.8

30 tests1.0x weight

#	Category	Score	Tests	Weight
1	Instruction Following	100.0	3	1.0x
2	Coding Tasks	97.7	3	1.0x
3	Code Quality	97.7	3	1.0x
4	Feature Implementation	96.7	3	1.0x
5	Bug Introduction Rate	96.7	3	1.0x
6	Bug Fixes	96.1	3	1.0x
7	Code Thoroughness	95.5	3	1.0x
8	Security Awareness	95.3	3	1.0x
9	Performance & Efficiency	85.0	3	1.0x
10	Long Reasoning	68.4	3	1.0x
11	Token Efficiency	27.8	30	1.0x

Individual Test Results

Token Efficiency27.8

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

27.4k

Total Tokens

820.9k

Long Reasoning68.4

Mathematical Proof

14.1k tok24.9s79.3

Legal Reasoning Chain

18.1k tok50.0s73.0

Multi-step Logic Puzzle

35.7k tok3m 10s53.0

Coding Tasks97.7

Graph Algorithm Implementation

17.6k tok32.8s95.0

REST API Design

19.2k tok49.6s100.0

Concurrent Data Pipeline

16.8k tok24.3s98.0

Bug Fixes96.1

Off-by-One Boundary Fix

18.2k tok48.6s88.2

Memory Leak Fix

19.4k tok1m 7s100.0

Race Condition Detection

18.2k tok48.8s100.0

Feature Implementation96.7

OAuth2 Integration

18.7k tok46.6s90.0

Webhook System

15.3k tok28.8s100.0

Search Autocomplete

20.0k tok26.6s100.0

Code Thoroughness95.5

Edge Case Coverage

45.0k tok1m 37s97.0

Test Suite Completeness

45.3k tok1m 51s92.0

Error Path Completeness

67.9k tok2m 57s97.6

Bug Introduction Rate96.7

Refactor Without Regression

31.8k tok1m 22s97.0

Merge Conflict Resolution

29.6k tok1m 5s97.0

Dependency Upgrade Safety

24.9k tok1m 27s96.0

Security Awareness95.3

SQL Injection Prevention

23.4k tok57.0s98.8

XSS Mitigation

19.8k tok1m 12s100.0

Secret Management

35.2k tok1m 38s87.2

Instruction Following100.0

Structured Output Compliance

29.8k tok31.5s100.0

Multi-step Instruction Chain

14.0k tok19.1s100.0

Constraint Adherence

13.4k tok18.3s100.0

Code Quality97.7

Clean Architecture Patterns

37.3k tok1m 17s100.0

TypeScript Best Practices

56.2k tok3m 11s97.0

Idiomatic Python

30.9k tok1m 33s96.0

Performance & Efficiency85.0

Memory-efficient Processing

36.5k tok2m 13s96.0

Algorithm Complexity

18.4k tok35.5s97.0

Query Optimization

30.1k tok2m 38s62.0

Regression History

Token Efficiencymajor

Score dropped -78.5% from 61.5 to 13.2

Detected Jun 6, 2026