Back to Dashboard
xAI
Grok
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#4
Token Benchmark
27.8Lower burn, higher score
Total Tokens
820.9k
~27.4k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Instruction Following | 100.0 | 3 | 1.0x | |
| 2 | Coding Tasks | 97.7 | 3 | 1.0x | |
| 3 | Code Quality | 97.7 | 3 | 1.0x | |
| 4 | Feature Implementation | 96.7 | 3 | 1.0x | |
| 5 | Bug Introduction Rate | 96.7 | 3 | 1.0x | |
| 6 | Bug Fixes | 96.1 | 3 | 1.0x | |
| 7 | Code Thoroughness | 95.5 | 3 | 1.0x | |
| 8 | Security Awareness | 95.3 | 3 | 1.0x | |
| 9 | Performance & Efficiency | 85.0 | 3 | 1.0x | |
| 10 | Long Reasoning | 68.4 | 3 | 1.0x | |
| 11 | Token Efficiency | 27.8 | 30 | 1.0x |
Individual Test Results
Token Efficiency27.8
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
27.4k
Total Tokens
820.9k
Long Reasoning68.4
Mathematical Proof
14.1k tok24.9s79.3
Legal Reasoning Chain
18.1k tok50.0s73.0
Multi-step Logic Puzzle
35.7k tok3m 10s53.0
Coding Tasks97.7
Graph Algorithm Implementation
17.6k tok32.8s95.0
REST API Design
19.2k tok49.6s100.0
Concurrent Data Pipeline
16.8k tok24.3s98.0
Bug Fixes96.1
Off-by-One Boundary Fix
18.2k tok48.6s88.2
Memory Leak Fix
19.4k tok1m 7s100.0
Race Condition Detection
18.2k tok48.8s100.0
OAuth2 Integration
18.7k tok46.6s90.0
Webhook System
15.3k tok28.8s100.0
Search Autocomplete
20.0k tok26.6s100.0
Edge Case Coverage
45.0k tok1m 37s97.0
Test Suite Completeness
45.3k tok1m 51s92.0
Error Path Completeness
67.9k tok2m 57s97.6
Refactor Without Regression
31.8k tok1m 22s97.0
Merge Conflict Resolution
29.6k tok1m 5s97.0
Dependency Upgrade Safety
24.9k tok1m 27s96.0
SQL Injection Prevention
23.4k tok57.0s98.8
XSS Mitigation
19.8k tok1m 12s100.0
Secret Management
35.2k tok1m 38s87.2
Structured Output Compliance
29.8k tok31.5s100.0
Multi-step Instruction Chain
14.0k tok19.1s100.0
Constraint Adherence
13.4k tok18.3s100.0
Code Quality97.7
Clean Architecture Patterns
37.3k tok1m 17s100.0
TypeScript Best Practices
56.2k tok3m 11s97.0
Idiomatic Python
30.9k tok1m 33s96.0
Memory-efficient Processing
36.5k tok2m 13s96.0
Algorithm Complexity
18.4k tok35.5s97.0
Query Optimization
30.1k tok2m 38s62.0
Regression History
1Token Efficiencymajor
Score dropped -78.5% from 61.5 to 13.2
Detected Jun 6, 2026