Updates
View All Updates
Benchmark
06/13/2025
SWE-bench results released
-
Foundation models still fail to solve real-world coding problems despite notable progress, highlighting remaining room for improvement.
-
The models’ performance drops significantly on “harder” problems that take >1 hour to complete. Only Claude Sonnet 4 (Nonthinking) , o3 and GPT 4.1 pass any of the >4 hour tasks (33% each).
-
Claude Sonnet 4 (Nonthinking) leads by a wide margin with 65.0% accuracy, and maintains both excellent cost efficiency at $1.24 per test and fast completion times (426.52s).
-
Tool usage patterns reveal models employ distinct strategies. o4 Mini brute-forces problems (~25k searches per task), while Claude Sonnet 4 (Nonthinking) employs a leaner, balanced mix (~9-10k default tool calls with far fewer searches).
Note that we run every model through the same evaluation harness to make direct comparisons between models, so the scores show relative performance, not each model’s best possible accuracy.
View Benchmarks Page
Benchmark
06/09/2025
LiveCodeBench: Models Struggle with Hard Competitive Programming Problems
Our results for LiveCodeBench are now live!
Key findings:
- Performance drops dramatically with difficulty: models average 85% on easy problems but only 11% on hard problems
- o4 Mini achieved state-of-the-art accuracy of 66% overall but struggled on hard problems with just 34% accuracy
- Aside from o4 Mini , top performers include o3 , Claude Opus 4 (Thinking) , and Gemini 2.5 Pro Preview
- High latency is a major concern: Gemini 2.5 Pro Preview averaged over 2 minutes per response, making it impractical for real-time applications
View Benchmark
Model
05/30/2025
Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.
We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!
We found:
- Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
- Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
- Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).
We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .
View Model Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Claude Sonnet 4 (Nonthinking)
Release date : 5/22/2025
Claude Opus 4 (Nonthinking)
Release date : 5/22/2025
Claude Sonnet 4 (Thinking)
Release date : 5/22/2025
Claude Opus 4 (Thinking)
Release date : 5/22/2025