Home

Updates

View All Updates

Benchmark

06/13/2025

SWE-bench results released

Foundation models still fail to solve real-world coding problems despite notable progress, highlighting remaining room for improvement.
The models’ performance drops significantly on “harder” problems that take >1 hour to complete. Only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 pass any of the >4 hour tasks (33% each).
Claude Sonnet 4 (Nonthinking) leads by a wide margin with 65.0% accuracy, and maintains both excellent cost efficiency at $1.24 per test and fast completion times (426.52s).
Tool usage patterns reveal models employ distinct strategies. o4 Mini brute-forces problems (~25k searches per task), while Claude Sonnet 4 (Nonthinking) employs a leaner, balanced mix (~9-10k default tool calls with far fewer searches).

Note that we run every model through the same evaluation harness to make direct comparisons between models, so the scores show relative performance, not each model’s best possible accuracy.

View Benchmarks Page

Benchmark

06/09/2025

LiveCodeBench: Models Struggle with Hard Competitive Programming Problems

Our results for LiveCodeBench are now live!

Key findings:

Performance drops dramatically with difficulty: models average 85% on easy problems but only 11% on hard problems
o4 Mini achieved state-of-the-art accuracy of 66% overall but struggled on hard problems with just 34% accuracy
Aside from o4 Mini , top performers include o3 , Claude Opus 4 (Thinking) , and Gemini 2.5 Pro Preview
High latency is a major concern: Gemini 2.5 Pro Preview averaged over 2 minutes per response, making it impractical for real-time applications

View Benchmark

Model

05/30/2025

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!

We found:

Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .

View Model Page

Latest Benchmarks

View All Benchmarks

LiveCodeBench

Our Implementation of the LiveCodeBench benchmark

Updated 06/16/2025

SWE-bench

Solving production software engineering tasks

Updated 06/13/2025

AIME

Extremely challenging math exam given to students

Updated 05/30/2025

CaseLaw

Private question-answer benchmark over Canadian court cases.

Updated 05/30/2025

ContractLaw

Benchmarking model performance on Contract Law Tasks

Updated 05/30/2025

Latest Model Releases

View All Models

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model

Claude Opus 4 (Nonthinking)

Release date : 5/22/2025

View Model

Claude Sonnet 4 (Thinking)

Release date : 5/22/2025

View Model

Claude Opus 4 (Thinking)

Release date : 5/22/2025

View Model

Read about our Finance Agent Benchmark in the Washington Post

Public Enterprise LLM Benchmarks

Updates

SWE-bench results released

LiveCodeBench: Models Struggle with Hard Competitive Programming Problems

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

Latest Benchmarks

LiveCodeBench

SWE-bench

AIME

CaseLaw

ContractLaw

Latest Model Releases

Claude Sonnet 4 (Nonthinking)

Claude Opus 4 (Nonthinking)

Claude Sonnet 4 (Thinking)

Claude Opus 4 (Thinking)

Join our mailing list to receive benchmark updates on