Benchmark
FinSpread-Bench
Updated yesterdayIntroduction
We chose Financial Spreading as our first benchmark task because it has evaded traditional methods for automation, until now. It is complex, requires both ad-hoc calculations and human judgement, and therefore serves as a yardstick for AI capabilities for wider financial services decisioning.
Financial spreading turns messy documents into standardized financial models. Borrowers send PDFs of annual accounts, tax filings, and management reports. Then, an analyst aligns time periods, extracts line items, normalizes definitions, and computes the metrics that drive a credit decision. Spreading is also where friction and inconsistency enter the process: it is time-consuming and error-prone.
Automated spreading promises more than just efficiency gains. It would allow institutions to underwrite complex loans in near real-time, and unlock entirely new customer experiences. It could also minimize human error and inconsistency. We are introducing FinSpread-Bench in order to track progress on automating this important financial services task.
Task description
Financial spreading goes beyond extracting numbers from documents. The agent must (a) convert financial statements across different formats and structures into a standardized output schema, (b) extract key metrics tailored to the customer’s specific spreading logic and business rules, and (c) pull relevant qualitative information from the reports such as risks from debt information, customer concentration, and audit notes. Moreover, it flags documents that cannot be reliably spread due to preconfigured rules.
Matching the rigor of an excellent human underwriter requires handling complex ambiguities. For example, the agent must detect unaligned periods across statements and sometimes fall back to calculating cash flow metrics from the balance sheet. It must identify duplicate interest expenses that appear both within operating expenses and in an EBITDA bridge to avoid double-counting. It must parse notes in the financial statements to split long-term debt into current and non-current portions based on maturity schedules. And it must apply customer-specific logic to tailor interest expense definitions, determining whether inter-company advances, operating lease interest, or other non-standard categories should be included.
Test dataset composition
The test dataset spans 1,312 fields across 84 documents based on real underwriting cases. It includes a wide variety of source documents that reflect real-world diversity: high-level condensed annual reports, detailed account statements with granular line items, separated financial statements covering different periods within the same document set, and documents with varying structures, formats, and levels of detail.
Evaluation methodology
We focus on metrics that are straightforward to explain and close to the business objective. Overall field match rate is the primary metric because it aligns with how underwriters operationalize correctness: for each field, did the agent produce the same value as expected when precisely following the spreading logic of the institution? Human baseline performance on this dataset is around 89% field match rate. Human underwriters do not perfectly adhere to underwriting guidelines. Typical errors include inconsistently applying ambiguous guidelines, ignoring notes in financial statements, and typos.
The agent is parameterized by three configuration dimensions. Each dimension defines a set of testable alternatives:
Reasoning model
- GPT-5.2
- GPT-5
- Claude Opus 4.6
- Claude Sonnet 4.5
- Claude Haiku 4.5
- Gemini 3.1 Pro
- Gemini 2.5 Pro
Doc extraction model
- Gemini 3.1 Pro
- Gemini 3 Flash
- Gemini 2.5 Pro
- Gemini 2.5 Flash
Tooling
- All tools
- No calculator tool
We present perturbations of the best configuration along one dimension at a time. In a full factorial design, this would yield 7 × 4 × 2 = 56 configurations. For simplicity, we hold the other dimensions constant.
Results
We found that the major frontier models exceed human performance in financial spreading: GPT-5.2, Gemini 3.1 Pro, and Opus 4.6 all comfortably land above 90% match rate for this task. However, the latest model generation is necessary to achieve this. Smaller and older models such as Haiku are not sufficient for this complex task. This finding supports the conclusion that AI has crossed a meaningful performance boundary in late 2025, which is now allowing to automate harder, more ambiguous tasks than before.
Across the test matrix, we see meaningful variance in accuracy across configurations:
Accuracy is strongly dependent on the reasoning model. Moving away from the latest frontier models leads to rapid performance degradation (e.g., comparing the results of Gemini 3.1 Pro vs. Gemini 2.5 Pro as the reasoning model). This shows that failures mostly stem from interpretation and judgment rather than incorrect extractions from the source documents. In practice, this means frontier reasoning models are currently required to reach the match rate levels reported here, and this level of performance has only become feasible very recently as these models have improved.
We also see a clear model size effect within reasoning models: smaller models (e.g., Claude Haiku vs. Sonnet) underperform in large part due to schema reliability rather than incorrect values. In our runs, 59.5% of Haiku outputs and 38.1% of Sonnet outputs failed to produce a complete schema. Conditional on producing a complete schema, match rates were 84.3% (Haiku) and 88.2% (Sonnet). These conditional figures should be treated cautiously because schema failures concentrate in the most complex cases; if those cases succeeded, average match rate would likely go down as well.
On the document extraction side, Gemini 2.5 Pro already seems to deliver comparable results to Gemini 3.1 Pro. Both Flash models appear more inconsistent in extraction accuracy today, implying a tradeoff between cost/latency and accuracy. Notably, the remaining extraction inaccuracies are usually not simple “number typos”, but failures to correctly interpret which period a table view refers to, leading to values being allocated to the wrong period. We expect the faster models to keep improving, and we hope they eventually become strong enough to reduce this tradeoff so we can default to the faster and cheaper variants more often.
To better understand performance drivers, we separate the evaluation into two sub-tasks: multi-period spreading, which extracts and harmonizes financial data across multiple periods (e.g., fiscal year 2023 and fiscal year 2024) to enable cross-period comparison, and year-to-date (YTD) spreading, which focuses on spreading and harmonizing year-to-date financials for in-period monitoring and decision-making.
YTD spreading is the harder task. Unlike annual financial statements that typically report complete periods, YTD packages often contain partial months or a mix of monthly and cumulative views, and the “correct” YTD total may need to be inferred rather than read directly: the agent must (i) correctly interpret which time window each table represents, (ii) reconcile inconsistent period cutoffs across statements, and (iii) compute missing totals by summing the right months while avoiding double counting when both monthly and YTD figures are present (see complex example in the introduction). These steps introduce more implicit logic and more opportunities for period misalignment, which is why YTD accuracy is more sensitive to reasoning and tooling support.
With respect to tooling, the largest remaining failure mode is numerical aggregation, especially when values must be combined across periods. Without reliable math support, accuracy degrades substantially for complex cases. This effect is much stronger in the YTD documents than in the multi-period documents, since YTD spreading is the most computation-heavy, whereas in AFS spreading many fields can be extracted as-is or only require simpler additions and subtractions.
Observed accuracy differences typically decompose into three types of error: (i) extraction errors where the source numbers are captured incorrectly, (ii) reasoning or mapping errors where the correct number is present but assigned to the wrong field or treated with the wrong convention, and (iii) computation errors where values must be calculated or aggregated across periods. When using the most powerful configuration in our test matrix, the remaining mistakes are dominated by (ii), mapping and convention errors. In particular, debt classification (e.g., short-term vs. long-term debt and current vs. non-current portions) requires nuanced judgement, often relying on institution-specific rules about how to map specific debt instruments and disclosure patterns into standardized fields. As a result, achieving near-perfect adherence typically requires iteration on these mapping nuances as edge cases are discovered.
We do not discuss cost and latency in detail here since both metrics are orders of magnitude below the human baseline. Completing the task currently takes 5–15 minutes and we expect this to drop significantly over the next six months as new models become available.
Conclusion
FinSpreadBench provides a repeatable evaluation framework for automated financial spreading, and the first results point to a few clear conclusions: Frontier reasoning models are crucial: accuracy degrades rapidly without them, while extraction quality is largely model-agnostic at this point in time. Most notably, the agent crossed the human performance threshold in late 2025 — a milestone made possible by models released in December. Future work will focus on robustness and consistency at the margin, as well as further improvements in cost and latency. We will update results as new models are released.