AI Model Benchmarks

Overall Rankings

Averaged across all benchmarks (normalized scores)

1

Gemini 3.1 Pro

Google

MathMultimodalLong contextReasoning

81.7

avg score

2

Claude Opus 4.6

Anthropic

Long contextCodingAnalysisSafety

73.6

avg score

3

GPT-5.3 Codex

OpenAI

Agentic tasksCodingTool use

72.8

avg score

4

Claude Opus 4.5

Anthropic

Creative writingAnalysisReasoningSafety

69.7

avg score

5

GPT-5.2

OpenAI

General purposeReasoningVersatility

69.2

avg score

6

Claude Sonnet 4.6

Anthropic

CodingSpeedValueComputer use

69.0

avg score

7

DeepSeek V4

DeepSeek

Cost efficiencyOpen weightsMath

68.0

avg score

8

Grok 4

xAI

Real-time dataReasoningUncensored

67.3

avg score

9

Gemini 3 Flash

Google

SpeedCostMultimodal

59.8

avg score

10

Llama 4 405B

Coding Benchmarks

Real-world software engineering and code generation tasks

SWE-Bench Verified

Real-world software engineering tasks from GitHub issues

Higher is better

1

Claude Sonnet 4.6

82.1%

2

Claude Opus 4.6

79.2%

3

GPT-5.3 Codex

78.5%

4

Claude Opus 4.5

74.4%

5

Gemini 3.1 Pro

74.2%

6

GPT-5.2

72.4%

7

Grok 4

71.3%

8

DeepSeek V4

68.9%

9

Llama 4 405B

65.2%

10

Gemini 3 Flash

62.8%

HumanEval

Python code generation with unit test verification

Higher is better

1

GPT-5.3 Codex

97.4%

2

Claude Opus 4.6

96.8%

3

Gemini 3.1 Pro

95.8%

4

Claude Sonnet 4.6

95.2%

5

Claude Opus 4.5

94.5%

6

GPT-5.2

94.1%

7

Grok 4

93.5%

8

DeepSeek V4

92.8%

9

Gemini 3 Flash

89.2%

10

Llama 4 405B

88.4%

Knowledge Benchmarks

General knowledge and understanding across domains

MMLU

Massive Multitask Language Understanding across 57 subjects

Higher is better

1

Gemini 3.1 Pro

94.3%

2

GPT-5.3 Codex

93%

3

Claude Opus 4.6

91.3%

4

GPT-5.2

90.8%

5

Claude Opus 4.5

89.8%

6

Grok 4

89.2%

7

Claude Sonnet 4.6

88.7%

8

DeepSeek V4

88.4%

9

Gemini 3 Flash

86.5%

10

Llama 4 405B

86.1%

Reasoning Benchmarks

Abstract reasoning and problem-solving capabilities

ARC-AGI-2

Abstract reasoning - designed to be easy for humans, hard for AI

Higher is better

1

Gemini 3.1 Pro

77.1%

2

GPT-5.2

54%

3

GPT-5.3 Codex

51.8%

4

Claude Opus 4.6

48.2%

5

Grok 4

45.3%

6

Claude Opus 4.5

43.5%

7

Claude Sonnet 4.6

42.1%

8

DeepSeek V4

41.2%

9

Gemini 3 Flash

38.5%

10

Llama 4 405B

32.8%

GPQA Diamond

Graduate-level science questions (physics, chemistry, biology)

Higher is better

1

Gemini 3.1 Pro

78.4%

2

Claude Opus 4.6

74.8%

3

Claude Opus 4.5

72.1%

4

GPT-5.3 Codex

71.5%

5

GPT-5.2

69.8%

6

Claude Sonnet 4.6

68.2%

7

Grok 4

67.3%

8

DeepSeek V4

65.8%

9

Llama 4 405B

61.2%

10

Gemini 3 Flash

58.2%

Math Benchmarks

Mathematical reasoning from basic to competition level

MATH-500

Competition-level mathematics problems

Higher is better

1

Gemini 3.1 Pro

91.2%

2

DeepSeek V4

85.3%

3

Claude Opus 4.6

82.4%

4

GPT-5.3 Codex

79.8%

5

Claude Opus 4.5

78.6%

6

Claude Sonnet 4.6

78.1%

7

Grok 4

77.8%

8

GPT-5.2

76.5%

9

Gemini 3 Flash

72.4%

10

Llama 4 405B

68.9%

AIME 2024

American Invitational Mathematics Examination problems

Higher is better

1

Gemini 3.1 Pro

68.5%

2

DeepSeek V4

52.8%

3

Claude Opus 4.6

45.2%

4

GPT-5.3 Codex

42.8%

5

Grok 4

41.2%

6

Claude Opus 4.5

40.8%

7

GPT-5.2

40.1%

8

Claude Sonnet 4.6

38.4%

9

Gemini 3 Flash

28.6%

10

Llama 4 405B

25.4%

Human Preference Benchmarks

Crowdsourced human preference ratings

Chatbot Arena Elo

Crowdsourced human preference ratings from 6M+ votes

Elo Rating

1

Gemini 3.1 Pro

1348

2

Claude Opus 4.6

1342

3

GPT-5.3 Codex

1335

4

Claude Opus 4.5

1328

5

Claude Sonnet 4.6

1318

6

GPT-5.2

1312

7

Grok 4

1305

8

DeepSeek V4

1298

9

Gemini 3 Flash

1285

10

Llama 4 405B

1275

Model Specifications

Context windows, pricing, and release information

Model	Company	Context	Max Output	Input $/1M	Output $/1M	Released
Claude Opus 4.6	Anthropic	1.0M	128K	$15.00	$75.00	2026-02
Claude Opus 4.5	Anthropic	200K	32K	$15.00	$75.00	2025-10
Claude Sonnet 4.6	Anthropic	200K	64K	$3.00	$15.00	2026-02
GPT-5.3 Codex	OpenAI	256K	32K	$5.00	$20.00	2026-02
GPT-5.2	OpenAI	128K	16K	$2.50	$10.00	2025-12
Gemini 3.1 Pro	Google	2.0M	65.536K	$1.25	$5.00	2026-02
Gemini 3 Flash	Google	1.0M	32.768K	$0.07	$0.30	2026-01
Grok 4	xAI	256K	32K	$3.00	$15.00	2026-02
DeepSeek V4	DeepSeek	128K	16K	$0.14	$0.28	2026-02
Llama 4 405B	Meta	128K	16K	Free	Free	2026-01

Methodology & Sources

Benchmark scores are collected from official model releases, research papers, and independent evaluation platforms. We prioritize verified, reproducible results.

Key Benchmarks Explained

SWE-Bench Verified: Real software engineering tasks from GitHub, testing end-to-end coding ability
HumanEval: Python code generation with unit test verification (164 problems)
MMLU: 57-subject knowledge test covering STEM, humanities, and social sciences
ARC-AGI-2: Abstract reasoning designed to be easy for humans, hard for AI
GPQA Diamond: Graduate-level science questions requiring deep understanding
MATH-500: Competition-level mathematics from AMC to IMO difficulty
Chatbot Arena: ELO ratings from 6M+ crowdsourced human preference votes

Limitations

Benchmarks don't capture all real-world capabilities
Scores can vary based on prompting and evaluation setup
Some benchmarks may be saturated by frontier models
Pricing and capabilities change frequently

Overall Rankings

Gemini 3.1 Pro

Claude Opus 4.6

GPT-5.3 Codex

Claude Opus 4.5

GPT-5.2

Claude Sonnet 4.6

DeepSeek V4

Grok 4

Gemini 3 Flash

Llama 4 405B

Coding Benchmarks

SWE-Bench Verified

HumanEval

Knowledge Benchmarks

MMLU

Reasoning Benchmarks

ARC-AGI-2

GPQA Diamond

Math Benchmarks

MATH-500

AIME 2024

Human Preference Benchmarks

Chatbot Arena Elo

Model Specifications

Methodology & Sources

Key Benchmarks Explained

Limitations

Want to stay updated on AI models?