Home

Updates

View All Updates

Model

05/30/2025

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

We’ve released our evaluation of Claude Opus across our benchmarks!

We found:

Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude Sonnet 3.7 (Thinking).

View Model Page

Model

05/27/2025

Claude Sonnet 4 (Thinking) evaluated on all benchmarks!

We’ve released our evaluation of Claude Sonnet 4 (Thinking) across all of our benchmarks!

Claude Sonnet 4 (Thinking) seriously underperforms when compared to its predecessor Claude 3.7 Sonnet (Thinking) on our proprietary TaxEval and ContractLaw benchmarks.
Claude Sonnet 4 (Thinking) significantly outperformed Claude Sonnet 4 (Nonthinking) on our reasoning benchmarks. For example, Claude Sonnet 4 (Thinking) scored 76.3% and Claude Sonnet 4 (Nonthinking) scored 38.5% on our AIME benchmark.
Claude Sonnet 4 (Thinking) is consistently in the top 10 across most of our benchmarks, though it is never the SOTA model.
The model latency is high when reasoning is enabled with a high token budget. On AIME, the model responded in four minutes, on average, with some questions taking over ten minutes.

The full writeups are linked in the comments. The final determinant of the Claude 4 family strengths will come from Opus 4, so stay tuned for the results!

View Model Page

Model

05/25/2025

Claude Sonnet 4 (Nonthinking) evaluated on all benchmarks!

We just evaluated Claude Sonnet 4 (Nonthinking) on all benchmarks!

Claude Sonnet 4 (Nonthinking) achieves 76.9% accuracy on average, a 7.1% improvement on Anthropic’s previous flagship model, Claude 3.7 Sonnet. The newer Claude is also nearly twice as fast for the same price.
Claude Sonnet 4 (Nonthinking) excels on the MGSM benchmark, edging out Claude 3.7 Sonnet (Thinking) by a tenth of a percentage point.
Claude Sonnet 4 (Nonthinking) also achieves strong performance on our proprietary CaseLaw benchmark, outperforming all previous Anthropic models.
Interestingly, Claude Sonnet 4 (Nonthinking) performs worse than its predecessor Claude 3.7 Sonnet by six percentage points on the MortgageTax benchmark. It even performs worse than its predecessor, Claude 3.5 Sonnet, on both the MortgageTax and CorpFin benchmarks!