New Finance Agent Benchmark Released

Public Enterprise LLM Benchmarks

Model benchmarks are seriously lacking. With Vals AI, we report how language models perform on the industry-specific tasks where they will be used.

Benchmark

06/09/2025

LiveCodeBench: Models Struggle with Hard Competitive Programming Problems

Our results for LiveCodeBench are now live!

Key findings:

View Benchmark

Model

05/30/2025

Opus (Nonthinking) evaluated on our benchmarks; Sonnet 4 evaluated on FAB.

We’ve released our evaluation of Claude Opus 4 (Nonthinking) across our benchmarks!

We found:

  • Opus 4 ranks #1 on both MMLU Pro and MGSM, narrowly setting new state-of-the-art scores. However, it achieves middle of the road performance across most other benchmarks.
  • Compared to its predecessor (Opus 3), Opus 4 ranked higher on CaseLaw (#22 vs. #24/62) and LegalBench (#8 vs #32/67) but scored notably lower on ContractLaw (#16 vs. #2/69)
  • Opus 4 is expensive, with an output cost of $75.00 /M tokens, 5x as much as Sonnet 4, and about 1.5x more expensive than o3 ($15 / $75 vs $10 / $40).

We also benchmarked Claude Sonnet 4 (Thinking) and Claude Sonnet 4 (Nonthinking) on our Finance Agent benchmark (the last remaining benchmark for this model). They performed nearly identically to Claude 3.7 Sonnet (Nonthinking) .

View Model Page

Model

05/27/2025

Claude Sonnet 4 (Thinking) evaluated on all benchmarks!

We’ve released our evaluation of Claude Sonnet 4 (Thinking) across all of our benchmarks!

The full writeups are linked in the comments. The final determinant of the Claude 4 family strengths will come from Opus 4, so stay tuned for the results!

View Model Page

Latest Benchmarks

View All Benchmarks

Latest Model Releases

View All Models

Claude Sonnet 4 (Nonthinking)

Claude Sonnet 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Opus 4 (Nonthinking)

Claude Opus 4 (Nonthinking)

Release date : 5/22/2025

View Model
Claude Sonnet 4 (Thinking)

Claude Sonnet 4 (Thinking)

Release date : 5/22/2025

View Model
Claude Opus 4 (Thinking)

Claude Opus 4 (Thinking)

Release date : 5/22/2025

View Model
Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.