Updated February 26, 2026

AI Model Benchmarks

Compare the latest AI models across coding, reasoning, math, and knowledge benchmarks. Data sourced from official releases and independent evaluations.

10
Models Compared
8
Benchmarks
5
Categories
Feb 2026
Last Updated

Overall Rankings

Averaged across all benchmarks (normalized scores)

1

Gemini 3.1 Pro

Google
MathMultimodalLong contextReasoning
81.7
avg score
2

Claude Opus 4.6

Anthropic
Long contextCodingAnalysisSafety
73.6
avg score
3

GPT-5.3 Codex

OpenAI
Agentic tasksCodingTool use
72.8
avg score
4

Claude Opus 4.5

Anthropic
Creative writingAnalysisReasoningSafety
69.7
avg score
5

GPT-5.2

OpenAI
General purposeReasoningVersatility
69.2
avg score
6

Claude Sonnet 4.6

Anthropic
CodingSpeedValueComputer use
69.0
avg score
7

DeepSeek V4

DeepSeek
Cost efficiencyOpen weightsMath
68.0
avg score
8

Grok 4

xAI
Real-time dataReasoningUncensored
67.3
avg score
9

Gemini 3 Flash

Google
SpeedCostMultimodal
59.8
avg score
10

Llama 4 405B

Meta
Open sourceSelf-hostingCustomization
58.2
avg score

Coding Benchmarks

Real-world software engineering and code generation tasks

SWE-Bench Verified

Real-world software engineering tasks from GitHub issues

Higher is better
1
Claude Sonnet 4.6
82.1%
2
Claude Opus 4.6
79.2%
3
GPT-5.3 Codex
78.5%
4
Claude Opus 4.5
74.4%
5
Gemini 3.1 Pro
74.2%
6
GPT-5.2
72.4%
7
Grok 4
71.3%
8
DeepSeek V4
68.9%
9
Llama 4 405B
65.2%
10
Gemini 3 Flash
62.8%

HumanEval

Python code generation with unit test verification

Higher is better
1
GPT-5.3 Codex
97.4%
2
Claude Opus 4.6
96.8%
3
Gemini 3.1 Pro
95.8%
4
Claude Sonnet 4.6
95.2%
5
Claude Opus 4.5
94.5%
6
GPT-5.2
94.1%
7
Grok 4
93.5%
8
DeepSeek V4
92.8%
9
Gemini 3 Flash
89.2%
10
Llama 4 405B
88.4%

Knowledge Benchmarks

General knowledge and understanding across domains

MMLU

Massive Multitask Language Understanding across 57 subjects

Higher is better
1
Gemini 3.1 Pro
94.3%
2
GPT-5.3 Codex
93%
3
Claude Opus 4.6
91.3%
4
GPT-5.2
90.8%
5
Claude Opus 4.5
89.8%
6
Grok 4
89.2%
7
Claude Sonnet 4.6
88.7%
8
DeepSeek V4
88.4%
9
Gemini 3 Flash
86.5%
10
Llama 4 405B
86.1%

Reasoning Benchmarks

Abstract reasoning and problem-solving capabilities

ARC-AGI-2

Abstract reasoning - designed to be easy for humans, hard for AI

Higher is better
1
Gemini 3.1 Pro
77.1%
2
GPT-5.2
54%
3
GPT-5.3 Codex
51.8%
4
Claude Opus 4.6
48.2%
5
Grok 4
45.3%
6
Claude Opus 4.5
43.5%
7
Claude Sonnet 4.6
42.1%
8
DeepSeek V4
41.2%
9
Gemini 3 Flash
38.5%
10
Llama 4 405B
32.8%

GPQA Diamond

Graduate-level science questions (physics, chemistry, biology)

Higher is better
1
Gemini 3.1 Pro
78.4%
2
Claude Opus 4.6
74.8%
3
Claude Opus 4.5
72.1%
4
GPT-5.3 Codex
71.5%
5
GPT-5.2
69.8%
6
Claude Sonnet 4.6
68.2%
7
Grok 4
67.3%
8
DeepSeek V4
65.8%
9
Llama 4 405B
61.2%
10
Gemini 3 Flash
58.2%

Math Benchmarks

Mathematical reasoning from basic to competition level

MATH-500

Competition-level mathematics problems

Higher is better
1
Gemini 3.1 Pro
91.2%
2
DeepSeek V4
85.3%
3
Claude Opus 4.6
82.4%
4
GPT-5.3 Codex
79.8%
5
Claude Opus 4.5
78.6%
6
Claude Sonnet 4.6
78.1%
7
Grok 4
77.8%
8
GPT-5.2
76.5%
9
Gemini 3 Flash
72.4%
10
Llama 4 405B
68.9%

AIME 2024

American Invitational Mathematics Examination problems

Higher is better
1
Gemini 3.1 Pro
68.5%
2
DeepSeek V4
52.8%
3
Claude Opus 4.6
45.2%
4
GPT-5.3 Codex
42.8%
5
Grok 4
41.2%
6
Claude Opus 4.5
40.8%
7
GPT-5.2
40.1%
8
Claude Sonnet 4.6
38.4%
9
Gemini 3 Flash
28.6%
10
Llama 4 405B
25.4%

Human Preference Benchmarks

Crowdsourced human preference ratings

Chatbot Arena Elo

Crowdsourced human preference ratings from 6M+ votes

Elo Rating
1
Gemini 3.1 Pro
1348
2
Claude Opus 4.6
1342
3
GPT-5.3 Codex
1335
4
Claude Opus 4.5
1328
5
Claude Sonnet 4.6
1318
6
GPT-5.2
1312
7
Grok 4
1305
8
DeepSeek V4
1298
9
Gemini 3 Flash
1285
10
Llama 4 405B
1275

Model Specifications

Context windows, pricing, and release information

Model Company Context Max Output Input $/1M Output $/1M Released
Claude Opus 4.6
Anthropic 1.0M 128K $15.00 $75.00 2026-02
Claude Opus 4.5
Anthropic 200K 32K $15.00 $75.00 2025-10
Claude Sonnet 4.6
Anthropic 200K 64K $3.00 $15.00 2026-02
GPT-5.3 Codex
OpenAI 256K 32K $5.00 $20.00 2026-02
GPT-5.2
OpenAI 128K 16K $2.50 $10.00 2025-12
Gemini 3.1 Pro
Google 2.0M 65.536K $1.25 $5.00 2026-02
Gemini 3 Flash
Google 1.0M 32.768K $0.07 $0.30 2026-01
Grok 4
xAI 256K 32K $3.00 $15.00 2026-02
DeepSeek V4
DeepSeek 128K 16K $0.14 $0.28 2026-02
Llama 4 405B
Meta 128K 16K Free Free 2026-01

Methodology & Sources

Benchmark scores are collected from official model releases, research papers, and independent evaluation platforms. We prioritize verified, reproducible results.

Key Benchmarks Explained

  • SWE-Bench Verified: Real software engineering tasks from GitHub, testing end-to-end coding ability
  • HumanEval: Python code generation with unit test verification (164 problems)
  • MMLU: 57-subject knowledge test covering STEM, humanities, and social sciences
  • ARC-AGI-2: Abstract reasoning designed to be easy for humans, hard for AI
  • GPQA Diamond: Graduate-level science questions requiring deep understanding
  • MATH-500: Competition-level mathematics from AMC to IMO difficulty
  • Chatbot Arena: ELO ratings from 6M+ crowdsourced human preference votes

Limitations

  • Benchmarks don't capture all real-world capabilities
  • Scores can vary based on prompting and evaluation setup
  • Some benchmarks may be saturated by frontier models
  • Pricing and capabilities change frequently

Want to stay updated on AI models?

We update benchmarks as new models are released and tested.

Follow AI News