Updates
View All Updates
Benchmark
06/09/2025
LiveCodeBench: Models Struggle with Hard Competitive Programming Problems
Our results for LiveCodeBench are now live!
Key findings:
- Performance drops dramatically with difficulty: models average 85% on easy problems but only 11% on hard problems
- o4 Mini achieved state-of-the-art accuracy of 66% overall but struggled on hard problems with just 33% accuracy
- Aside from o4 Mini , top performers include o3 , Claude Opus 4 (Thinking) , and Gemini 2.5 Pro Preview
- High latency is a major concern: Gemini 2.5 Pro Preview averaged over 2 minutes per response, making it impractical for real-time applications
View Benchmark
Model
05/30/2025
Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.
We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!
We found:
- Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
- Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
- Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).
We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .
View Model Page
Model
05/27/2025
Claude Sonnet 4 (Thinking) evaluated on all benchmarks!
We’ve released our evaluation of Claude Sonnet 4 (Thinking) across all of our benchmarks!
- Claude Sonnet 4 (Thinking) seriously underperforms when compared to its predecessor Claude 3.7 Sonnet (Thinking) on our proprietary TaxEval and ContractLaw benchmarks.
- Claude Sonnet 4 (Thinking) significantly outperformed Claude Sonnet 4 (Nonthinking) on our reasoning benchmarks. For example, Claude Sonnet 4 (Thinking) scored 76.3% and Claude Sonnet 4 (Nonthinking) scored 38.5% on our AIME benchmark.
- Claude Sonnet 4 (Thinking) is consistently in the top 10 across most of our benchmarks, though it is never the SOTA model.
- The model latency is high when reasoning is enabled with a high token budget. On AIME, the model responded in four minutes, on average, with some questions taking over ten minutes.
The full writeups are linked in the comments. The final determinant of the Claude 4 family strengths will come from Opus 4, so stay tuned for the results!
View Model Page
Latest Benchmarks
View All Benchmarks
Latest Model Releases
View All Models
Claude Sonnet 4 (Nonthinking)
Release date : 5/22/2025
Claude Opus 4 (Nonthinking)
Release date : 5/22/2025
Claude Sonnet 4 (Thinking)
Release date : 5/22/2025
Claude Opus 4 (Thinking)
Release date : 5/22/2025