GPT 4o (2024-08-06)

Performance by Benchmark

Benchmarks

Accuracy

Rankings

FinanceAgent

19.3%

( 13 / 24 )

19.3%

13 / 24

CorpFin

49.3%

( 31 / 39 )

49.3%

31 / 39

CaseLaw

83.3%

( 17 / 62 )

83.3%

17 / 62

ContractLaw

61.7%

( 57 / 69 )

61.7%

57 / 69

TaxEval

75.0%

( 18 / 49 )

75.0%

18 / 49

MortgageTax

75.2%

( 9 / 29 )

75.2%

9 / 29

Math500

75.2%

( 31 / 45 )

75.2%

31 / 45

AIME

14.0%

( 29 / 39 )

14.0%

29 / 39

MGSM

90.6%

( 19 / 43 )

90.6%

19 / 43

LegalBench

79.0%

( 22 / 67 )

79.0%

22 / 67

MedQA

88.2%

( 20 / 47 )

88.2%

20 / 47

MMLU Pro

74.1%

( 24 / 40 )

74.1%

24 / 40

MMMU

65.5%

( 19 / 26 )

65.5%

19 / 26

SWE-bench

27.2%

( 8 / 10 )

27.2%

8 / 10

Academic Benchmarks

Proprietary Benchmarks (contact us to get access)

Overview

GPT-4o is OpenAI’s latest flagship model, optimized for multi-step tasks. It represents a sweet spot between performance and efficiency, making it particularly attractive for production deployments that require high intelligence but need to manage costs.

Key Specifications

Context Window: 128,000 tokens
Output Limit: 16,384 tokens
Training Cutoff: October 2023
Pricing:
- Input: $2.50 per million tokens
- Cached Input: $1.25 per million tokens
- Output: $10.00 per million tokens

Performance Highlights

Speed: Faster inference than standard GPT-4
Cost Efficiency: 4x cheaper than GPT-4 Turbo
Reasoning: Strong performance on complex logical tasks
Consistency: Reliable outputs across different domains

Benchmark Results

Excellent performance across our benchmarks:

TaxEval: Near top performance in tax reasoning
LegalBench: Strong showing in legal analysis
ContractLaw: High accuracy in contract interpretation
CaseLaw: Competitive performance in case law understanding

Use Case Recommendations

Best suited for:

Production API deployments
Complex reasoning tasks
Legal document analysis
Financial modeling
Tasks requiring balance of cost and capability

Limitations

Unable to perform the same complex, multi-step reasoning as o1

Comparison with Other Models

More powerful than GPT-4o Mini
Competitive with Claude 3.5 Sonnet
Better performance/cost ratio than most competitors

Performance by Benchmark

Cost Analysis

Overview

Key Specifications

Performance Highlights

Benchmark Results

Use Case Recommendations

Limitations

Comparison with Other Models

Join our mailing list to receive benchmark updates on