New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Model

05/30/2025

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

We’ve released our evaluation of Claude Opus across our benchmarks!

We found:

  • Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
  • Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
  • Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude Sonnet 3.7 (Thinking).

View Model Page

Model

05/27/2025

Claude Sonnet 4 (Thinking) evaluated on all benchmarks!

We’ve released our evaluation of Claude Sonnet 4 (Thinking) across all of our benchmarks!

The full writeups are linked in the comments. The final determinant of the Claude 4 family strengths will come from Opus 4, so stay tuned for the results!

View Model Page

Model

05/25/2025

Claude Sonnet 4 (Nonthinking) evaluated on all benchmarks!

We just evaluated Claude Sonnet 4 (Nonthinking) on all benchmarks!

Stay tuned for evaluations of Sonnet 4’s thinking variant, as well as Opus 4!

View Models Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Claude Sonnet 4 (Nonthinking)

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Opus 4 (Nonthinking)

Claude Opus 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Sonnet 4 (Thinking)

Claude Sonnet 4 (Thinking)

Release date : 5/22/2025

View Model
Mistral Medium 3.1 (05/2025)

Mistral Medium 3.1 (05/2025)

Release date : 5/7/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.