Claude Opus 4.7 Is on the Way to 72% on the Finance Agent Benchmark. Here's Why That Matters.

64.4% today, 72% with Mythos Preview on the horizon — a 20% improvement across two generations. Anthropic's finance ambitions are reshaping what AI can do for private capital teams.

Most AI model releases are incremental. A few points on a coding benchmark here, a marginal improvement on reasoning there. The releases that matter for finance teams are different — they're the ones that cross a threshold, where something that was almost reliable becomes reliable enough to trust.

Claude Opus 4.7, released this week by Anthropic, is one of those releases.

On the Finance Agent evaluation — a benchmark that tests whether an AI model can actually perform multi-step financial analysis, not just answer questions about it — Opus 4.7 scored 64.4%, up from roughly 60% on Opus 4.6. On the General Finance module specifically, it reached 81.3%, up from 76.7%. And Anthropic has already signaled that Claude Mythos Preview, a more powerful model in limited access, will push the overall score past 72% — a 20% improvement from where the previous generation started.

That trajectory — 60% to 64.4% to 72% across two releases — is the steepest rate of improvement on any finance-specific AI benchmark. But the numbers, while striking, are the least interesting part of the story. What matters is why the scores improved — because those improvements map directly onto the workflows that consume the most time in private capital finance operations.

64.4%

Finance Agent benchmark today — up from ~60% on Opus 4.6

72%

Projected with Mythos Preview — a 20% leap from the previous generation

3.75 MP

Image resolution — 3x more visual detail than prior models

What "Finance Agent" Actually Means

Before unpacking the improvements, it's worth understanding what the Finance Agent evaluation tests. This is not a multiple-choice exam about financial concepts. It's a simulation of actual financial work: the model receives source documents, financial data, and a task — build a model, identify anomalies, reconcile figures across sources — and is evaluated on whether it produces the correct output through the correct reasoning chain.

The tasks include multi-step calculations with intermediate validation, cross-referencing data from different formats and sources, and applying domain-specific logic (covenant calculations, waterfall mechanics, portfolio attribution). A model that gets the final number right through an incorrect method doesn't pass.

The jump from ~60% to 64.4% overall means the model is successfully completing tasks that the prior version failed — tasks that involve exactly the kind of structured, multi-step reasoning that private capital finance teams perform daily. And 81.3% on General Finance specifically — up from 76.7% — shows that the gains are concentrated in the modules most relevant to fund operations.

Anthropic is also now state of the art on GDPval-AA, a third-party evaluation of economically valuable knowledge work spanning finance, legal, and other professional domains. This isn't a lab benchmark — it's a measure of whether the model can do the kind of work that professionals actually get paid to do. Separately, on BigLaw Bench (a legal domain evaluation by Harvey), Opus 4.7 scored 90.9%, and on Databricks OfficeQA Pro — which tests document comprehension across enterprise formats — it produces 21% fewer errors than Opus 4.6. The improvements aren't finance-specific; they reflect a broad leap in professional-grade reasoning.

Instruction Following: The Improvement That Changes Everything

The single most impactful improvement in Opus 4.7 for finance applications is substantially better instruction following. This sounds mundane until you think about what it means in practice.

Financial workflows are defined by precision. When a CFO says "calculate the preferred return at the LP level before applying the GP catch-up," every word in that sentence is load-bearing. "At the LP level" means something different from "at the fund level." "Before applying" specifies sequence. "Preferred return" has a specific contractual definition that varies by fund agreement.

Previous models interpreted instructions loosely — close enough for a draft, not close enough for a deliverable. If you asked for a waterfall calculation that applied the European whole-fund model, the model might produce something that looked right but applied deal-by-deal logic in the carry tier. If you asked for ILPA-compliant capital account statements, it might skip the since-inception contribution reconciliation.

Opus 4.7 takes instructions literally. When the prompt says "apply the hurdle rate on a compounded quarterly basis using actual/360 day count," the model does exactly that — not annual compounding, not 30/360, not an approximation. This is the difference between a tool that produces a first draft requiring human correction and a tool that produces a work product requiring human review. The distinction might seem subtle. It is not. The first costs you an analyst's afternoon. The second saves it.

What this means for your team: Prompts and workflows built for earlier models may actually need adjustment — Opus 4.7 follows instructions so precisely that loosely-written prompts can produce unexpected results. If your existing workflow says "summarize the quarterly financials," Opus 4.7 will produce a summary. If what you actually want is a variance analysis with commentary, you now need to say so. This is a feature, not a limitation.

Reading Financial Documents at Full Resolution

Opus 4.7 accepts images up to 2,576 pixels on the long edge — approximately 3.75 megapixels. That's more than three times the resolution of prior Claude models. For a model that processes text, this might be a footnote. For a model that processes financial documents, it's transformative.

Consider what arrives in a typical portfolio company reporting package: multi-tab Excel files exported as PDFs with 8-point font and dense column structures, bank covenant compliance certificates with ratio calculations in cramped tables, audited financial statements with footnotes that modify the numbers in the body. Prior models could read these documents, but they lost detail — small fonts blurred, dense tables confused row/column alignment, footnotes were sometimes truncated.

At 3.75 megapixels, Opus 4.7 resolves the detail that matters:

Covenant compliance certificates: Fine-print ratio calculations, cure period deadlines, and threshold tables — the exact details that determine whether a portfolio company is in technical default
Fund accounting reconciliations: Dense trial balance exports where a single misread digit in row 147 propagates through every downstream calculation
LP side letter provisions: Scanned documents with handwritten annotations, rider clauses in small print, and fee arrangement tables that differ by investor
Board materials: Complex organizational charts, waterfall diagrams, and cap table summaries where the visual layout encodes relationships that text alone cannot capture

This doesn't just improve accuracy — it expands the universe of documents that can be processed without manual OCR pre-processing or re-formatting. The document arrives, the model reads it, and the data extraction begins. No human reformatting step in between.

Real-World Finance Work, Not Lab Exercises

Anthropic's internal testing found that Opus 4.7 produces "more rigorous analyses and models, more professional presentations, and tighter integration across tasks" compared to its predecessor. That description maps directly onto three workflows that consume the most senior time in private capital operations:

Financial modeling. The gap between a technically correct model and a professionally defensible model is large. Prior models could build a basic LBO or a cash flow projection. Opus 4.7 produces models with proper error checks, sensitivity tables, assumption documentation, and output formatting that a managing director can present to an investment committee without reformatting. The model doesn't just calculate — it structures the calculation the way a senior analyst would.

LP reporting packages. "More professional presentations" means the gap between AI-generated output and what the IR team would actually send to LPs is narrowing. Cover page formatting, consistent chart styles, proper footnote numbering, ILPA-aligned table structures — the finishing work that previously took a full day of human polish is increasingly handled by the model itself.

Cross-task integration. This is the least obvious improvement and potentially the most valuable. Financial workflows are chains: data extraction feeds reconciliation, reconciliation feeds valuation, valuation feeds reporting, reporting feeds the LP letter. Prior models handled each step well in isolation but lost context across steps — a reconciliation adjustment in step two wouldn't propagate to the commentary in step five. Opus 4.7 maintains tighter integration across these chains, reducing the manual stitching that currently makes multi-step workflows fragile.

Memory: The Quiet Revolution

The least-discussed improvement may be the most significant for how finance teams actually use AI. Opus 4.7 is better at using file system-based memory — it remembers important context across long, multi-session work, and uses that context to move on to new tasks with less up-front briefing.

Why does this matter for finance? Because financial work is inherently multi-session. A quarterly close doesn't happen in one sitting. The team works on data collection Monday, reconciliation Tuesday through Thursday, report assembly the following week. Each session builds on what came before.

With prior models, each session started cold. The context that was painstakingly established yesterday — that Fund III uses a European waterfall, that LP X has a side letter modifying their fee arrangement, that portfolio company Y restated Q2 revenue — had to be re-established before work could begin. It was like briefing a new contractor every morning on a project they'd been working on all week.

Opus 4.7 carries forward the notes that matter. Fund-specific conventions, LP-specific requirements, previously identified data issues, reconciliation adjustments from earlier in the cycle — the model retains and applies this context without being reminded. The practical effect is that sessions two through ten of a multi-day workflow start faster and produce more consistent output, because the model is building on accumulated context rather than rediscovering it each time.

For a quarterly close process that spans two weeks and involves dozens of working sessions, this compounds. Each session that starts with context intact instead of cold saves 15-30 minutes of re-briefing — and eliminates the inconsistencies that arise when context is re-established imprecisely.

The Trajectory That Matters

The jump from ~60% to 64.4% on the Finance Agent benchmark — with 72% in sight via Mythos Preview — is not just a number. It's a signal about where AI capability stands relative to what finance teams actually need, and how fast the gap is closing.

At 60%, a model handles straightforward financial tasks reliably but fails on two in five complex multi-step problems. At that level, human review isn't optional — it's load-bearing. Every output requires a trained eye to catch the cases where the model's reasoning broke down.

At 64.4% today, and 72% soon, the failure rate on complex tasks drops meaningfully. More importantly, the type of failure changes. At 60%, failures were often structural — the model chose the wrong approach entirely. At 64% and above, failures are more likely to be edge cases — unusual fund structures, ambiguous contract language, data quality issues that would challenge a human analyst too. These are reviewable failures, not rebuild-from-scratch failures.

This matters because it changes the economics of AI-assisted finance work. When the model's output requires a full rebuild 40% of the time, the time savings over doing it manually are modest. When the model's output requires targeted corrections on a shrinking share of edge cases — corrections that a senior analyst can spot in minutes — the time savings are substantial.

Anthropic is not the only company building AI models. But they are the only company that publishes a finance-specific agent benchmark, invests in finance-domain capability as a named priority, and is tracking toward a 20% improvement across two generations. That trajectory — and the willingness to measure it publicly — tells you something about where they're going.

See Opus 4.7 in Action on Your Data

Equiforte runs on Claude's latest models. Book a demo to see how Opus 4.7 handles your firm's actual reporting workflows — not a generic simulation.

Request a Demo

Stay on top of AI in private capital

Subscribe to Equiforte Pulse for market intelligence and analysis on AI, models, and operations for private capital finance teams.

Work email required. Unsubscribe any time.