PolicyBench

Benchmarking no-tool tax-and-benefit estimation in frontier language models

Max Ghenis

Abstract

PolicyBench evaluates whether frontier language models can estimate household tax and benefit outputs from household facts without tools. The benchmark covers the United States and the United Kingdom, uses sampled US Enhanced Current Population Survey (CPS)-derived scenarios and a public UK calibrated transfer dataset, and reports three complementary scores: a within-1% hit rate as the headline ranking metric, an exact-match rate as the deployability bar (predictions are graded only if they match the PolicyEngine reference to the dollar or to the eligibility flag), and a continuous bounded score that awards partial credit for close answers. The headline metric is within-1% rather than exact match because the panels are zero-inflated — 71% of UK reference outputs and 82% of US reference outputs are exact zeros — which compresses top-of-leaderboard exact-match rates near the share of zero cases a hedging model can predict and makes within-1% the more discriminating ranking on the positive-reference cases that drive quality differences. The manuscript snapshot is a 100-household-per-country public preview, not a protected held-out leaderboard: the current public scenario explorer exposes prompts and reference outputs, so open-set leakage is a central limitation of any public ranking. In the frozen snapshot used for this manuscript (2026-05-20), GPT 5.5 is top-scoring on within-1% in both country tables at 79.4 in the US and 59.0 in the UK. In both countries, multi-step tax quantities and positive-dollar benefit cases are materially harder than zero cases. The live benchmark is available at https://policybench.org.

Introduction

Household tax-and-benefit estimation sits between arithmetic and policy discussion. Each case has numeric labels, but generating those labels requires filing status, household composition, income concepts, program thresholds, and jurisdiction-specific rules. PolicyBench measures whether models can map household records to those outputs.

Most current language-model evaluations do not test this mapping directly. General math benchmarks emphasize symbolic manipulation and exact final answers (Cobbe et al. 2021; Hendrycks et al. 2021), while generic question-answering benchmarks emphasize recall or instruction following. PolicyBench instead asks whether models can transform household facts into policy outputs without access to tools or simulators.

This matters for public-facing calculators, analyst workflows, and tax-preparation or screening systems. The benchmark is intended to separate verbal policy fluency from household-level quantitative prediction.

The most relevant prior work is tax and statutory reasoning. SARA frames statutory interpretation, including tax-relevant rule application, as a language-understanding problem (Holzenberger et al. 2021). LegalBench broadens this view and includes tax-oriented numeric tasks such as sara_numeric (Guha et al. 2023). RuleArena evaluates rule-guided reasoning in regulation-style settings (Zhou et al. 2025), and TaxCalcBench evaluates frontier models on tax calculation from structured return-like inputs (Bock et al. 2025). Shanahan et al. (2025) report a similar split on the Volunteer Income Tax Assistance (VITA) test: models do better on tax knowledge questions than on open-ended calculation tasks.

PolicyBench also sits within a broader literature on quantitative reasoning benchmarks. Canonical math evaluations such as GSM8K and MATH normalized exact final-answer scoring for numeric tasks (Cobbe et al. 2021; Hendrycks et al. 2021). Recent work has questioned pure exact-match scoring in numeric QA and temporal reasoning, arguing for error-aware metrics instead (Abbood et al. 2025). PolicyBench treats this as a measurement design choice rather than a forced pick: it leads with a within-1% hit rate as the headline ranking metric, retains the exact-match rate as the deployability bar (a tax filer or benefits caseworker can’t ship “close”), and reports a bounded continuous score that awards relative-error partial credit as a secondary tracking diagnostic. The headline choice is driven by zero inflation in the benchmark panels — particularly in the UK, where 71% of reference outputs are exact zeros — which compresses exact-match leaderboards near the share of zero cases a hedging model can predict and gives within-1% more ranking power on the positive-reference cases.

Finance-domain benchmarks provide another relevant precedent. FinQA and TAT-QA both show that realistic quantitative reasoning often requires combining structured numbers with domain knowledge rather than solving stylized school-math problems (Chen et al. 2021; Zhu et al. 2021). PolicyBench differs in that its reference outputs are generated by executable tax-benefit microsimulation rather than annotated document questions, but the underlying motivation is similar: domain-specific quantitative reasoning deserves dedicated evaluation.

Finally, PolicyBench depends on two infrastructure literatures that are not themselves large language model (LLM) benchmarks. One is structured-output reliability, since the benchmark relies on multi-output JSON responses and parse coverage rather than free-form prose (Shorten et al. 2024). The other is tax-benefit microsimulation, where systems such as EUROMOD provide the methodological precedent for evaluating policy rules over household microdata (Sutherland and Figari 2013). There is also operational work on applying and evaluating AI systems in public-benefits settings, including caseworker-assist and Supplemental Nutrition Assistance Program (SNAP)-focused evaluations (Nava Labs 2026, 2025; ZenML LLMOps Database 2025). PolicyBench combines these strands into a public cross-model benchmark over household-level tax and benefit outputs.

Benchmark design

PolicyBench asks models to predict all benchmark outputs for a household in a single structured response. One response per household reduces repeated prompt cost and keeps the task at the household-return level rather than turning it into a sequence of unrelated one-output calls.

Primary metric: within 1%

The primary headline metric is the within-1% hit rate: the share of requested outputs the model gets within one percent of the PolicyEngine reference value. For currency amounts, “within 1%” means |pred − ref| ≤ 0.01 × |ref| when the reference is nonzero and |pred| ≤ 1 (the same one-currency-unit absolute tolerance used by exact match) when the reference is zero. Binary outputs are requested as integer 0/1 eligibility flags and scored identically to the exact-match metric.

Within-1% is the headline because the benchmark panels are zero-inflated. In the frozen manuscript snapshot, 71% of UK reference outputs and 82% of US reference outputs are exact zeros — most sampled households are not eligible for any given program, owe no capital gains tax, do not phase into a refundable credit, and so on. A model that hedges to zero on every requested output therefore earns full exact credit on every zero case, which compresses top-of-leaderboard exact-match rates near the share of zero cases the panel contains. In the UK panel this compression is severe: the top seven models in the manuscript snapshot are tied within 0.6 percentage points on exact match but span 4.6 percentage points on within-1%, because within-1% credits close-but-not-exact predictions on the 29% of positive-reference cases where rule comprehension is actually tested. Within-1% inherits the same one-currency-unit tolerance on zeros, so a hedging baseline gets no extra credit from the relaxed threshold; it is the partial credit on positive-reference cases that yields the discriminating signal.

The within-1% rate is aggregated to a household score and then to a country score, with the same population-derived output weights that the secondary bounded score uses. Equal household weight within each country preserves comparability across scenarios and slices. We report US and UK leaderboards separately rather than collapsing them into a combined cross-country rank.

Deployability bar: exact match

PolicyBench retains the exact-match rate as the deployability bar: the share of requested outputs the model gets right to the dollar (or, for eligibility flags, to the boolean). A tax filer, a benefit estimator, or a caseworker cannot ship “close” — a prediction off by $50 is no more usable than one off by $500, because neither matches the bottom line a downstream system or a household needs. For currency amounts, “exact” means within 1 currency unit of the reference value after numeric parsing. For binary outputs, “exact” means the parsed value is exactly the same 0/1 flag as the reference.

The exact-match rate uses the same per-output weighting as the primary within-1% metric. It is reported alongside the headline in every leaderboard table and remains the bar a production-facing system must clear. We report it as the deployability bar rather than the headline ranking because in zero-inflated panels it is dominated by zero-case behavior and discriminates less well between models that understand the policy rules and models that simply hedge to zero.

Secondary metric: bounded continuous score

PolicyBench also reports a bounded 0-100 score that awards smooth partial credit for close-but-not-exact amount answers. This is informative for tracking conceptual progress year over year while exact rates remain low, and it captures the qualitative gap between a model that is consistently close and one that is consistently far from the reference value. For amount variables, the row-level bounded score is max(0, 1 - |pred - ref| / |ref|) when the reference value is nonzero and 1{pred = 0} when the reference value is zero. For binary outputs, it uses exact 0/1 matching. The exact, within-1%, within-5%, and within-10% hit rates are reported separately as threshold diagnostics; they are not averaged to create the bounded score.

PolicyBench also tracks mean absolute error and related error metrics. These remain secondary diagnostics under the primary within-1% metric, the deployability-bar exact-match rate, and the bounded score. The three-metric design follows recent numeric-evaluation literature that questions pure exact-match scoring (Abbood et al. 2025) without abandoning exactness as the production bar.

All requested US outputs are annual amounts or annual eligibility indicators for tax year 2026; UK outputs are annual amounts or annual eligibility indicators for fiscal year 2026-27. For currency amounts, the “exact” hit rate means within 1 currency unit of the reference value after numeric parsing. Percentage-threshold hit rates use relative error when the reference value is nonzero and the same 1 currency-unit absolute tolerance when the reference value is zero. Binary outputs are parsed as numeric 0/1 eligibility flags and scored by exact classification accuracy. Missing, unparseable, or non-0/1 answers receive zero score for that requested output.

Aggregation proceeds in three steps. First, each household-output prediction receives a 0-100 score. Second, each household receives one model score: requested output rows are weighted by output-group weights constructed from the full weighting population, then renormalized within the household over outputs that are actually requested for that household. Person-level coverage flags are scored at the person row; their output-group weight is split across the relevant people in the household and is based on PolicyEngine value proxies where available. Third, country scores average those household scores with equal household weight. The paper reports US and UK leaderboards separately rather than averaging them into a combined cross-country rank. Equal-output-group scores and other alternative views are reported as sensitivity checks.

The benchmark requires each model response to include numeric answers and one explanation per requested output. Both the exact-match rate and the bounded score use only the numeric answers. Explanations are retained for scenario exploration and qualitative error analysis; they should not be interpreted as faithful traces of model reasoning.

Because explanations are required, the canonical task measures policy estimation under a public-facing structured-response contract, not isolated arithmetic accuracy. Prompt fairness is part of the benchmark contract. The current release uses one prompt template per country, with no model-specific tuning. Models receive the same household facts, the same requested outputs, and no web or tool access. The prompt states that unlisted numeric inputs are 0, unlisted boolean or status facts are false, and household characteristics are constant over the tax-benefit year. Provider-specific differences are limited to the response transport needed to obtain structured output.

Frozen snapshot and open-set status

Leaderboard positions are version-sensitive, so manuscript claims refer to the frozen source-run exports in Table 1 rather than to the live site. The committed source-run exports are the manuscript artifacts. The public site exposes the current prompts, predictions, explanations, and reference outputs for transparency. This makes the public leaderboard open-set: models, model providers, or benchmark users could learn from the released cases before later runs. Protected leaderboard claims would require a separate held-out or rotating set.

Table 1: Frozen manuscript snapshot.
Item Value
0 Snapshot date 2026-05-20
1 Model response date 2026-05-13 to 2026-05-20
2 Policy period US tax year 2026; UK fiscal year 2026-27
3 Frozen export paper/snapshot/20260501/runs/us_full_run_20260513_policyengine_4_4_4_nested_outputs/data.json; paper/snapshot/20260501/runs/uk_full_run_20260513_policyengine_4_4_4_nested_outputs/data.json
4 Frozen export SHA-256 prefixes US 1016061c8f4a; UK 9dc661d1ee7e
5 Snapshot manifest paper/snapshot/20260501/manifest.json
6 Response retry artifacts paper/snapshot/20260501/response_retries
7 Row-repair artifacts paper/snapshot/20260501/row_repairs
8 Deviation annotations annotations/full_run_20260513_policyengine_4_4_4_nested_outputs
9 US scenarios SHA-256 prefix b05091225c06
10 US reference outputs SHA-256 prefix 04febaadd091
11 UK scenarios SHA-256 prefix eee79fb1db36
12 UK reference outputs SHA-256 prefix 6286a238e54a
13 Benchmark spec SHA-256 prefix ce233f8cbb05
14 Frozen prompt payload SHA-256 prefixes US 15e3c512a846; UK 2f64b11a99aa
15 Scoring code SHA-256 prefix 42795fb19a48
16 US run label us_full_run_20260513_policyengine_4_4_4_nested_outputs
17 UK run label uk_full_run_20260513_policyengine_4_4_4_nested_outputs
18 Household sample seed 42
19 PolicyEngine.py 4.4.4
20 PolicyEngine-US policyengine-us 1.691.10
21 PolicyEngine-UK policyengine-uk 2.88.16
22 US data bundle policyengine-us-data 1.113.1
23 UK runtime bundle policyengine-uk-data 1.55.5
24 UK scenario source enhanced_cps_2025.h5
25 UK scenario-source repository PolicyEngine/policyengine-uk-data
26 UK scenario-source pinned commit 9514dfb7ec607897c9f7122a2e073b922c9fd8b6
27 UK scenario-source SHA-256 199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866
28 Households 100 US and 100 UK
29 Models 13 shared models
30 Output groups 18 US and 7 UK
31 Condition No tools, no web access, one structured response per household
32 Response contract Numeric answer and non-empty explanation for every requested output
Table 2: Model run configuration in the frozen manuscript snapshot.
Model PolicyBench ID Provider ID US parsed UK parsed Structured output Decoding
0 Claude Haiku 4.5 claude-haiku-4.5 claude-haiku-4-5-20251001 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
1 Claude Opus 4.7 claude-opus-4.7 claude-opus-4-7 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
2 Claude Sonnet 4.6 claude-sonnet-4.6 claude-sonnet-4-6 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
3 Gemini 3 Flash Preview gemini-3-flash-preview gemini/gemini-3-flash-preview 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
4 Gemini 3.1 Flash Lite Preview gemini-3.1-flash-lite-preview gemini/gemini-3.1-flash-lite-preview 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
5 Gemini 3.1 Pro Preview gemini-3.1-pro-preview gemini/gemini-3.1-pro-preview 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
6 Gemini 3.5 Flash gemini-3.5-flash gemini/gemini-3.5-flash 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
7 GPT 5.4 mini gpt-5.4-mini gpt-5.4-mini 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
8 GPT 5.4 nano gpt-5.4-nano gpt-5.4-nano 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
9 GPT 5.5 gpt-5.5 gpt-5.5 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
10 Grok 4.1 Fast grok-4.1-fast xai/grok-4-1-fast-non-reasoning 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
11 Grok 4.20 grok-4.20 xai/grok-4.20-reasoning 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning
12 Grok 4.3 grok-4.3 xai/grok-4.3 2,088/2,088 700/700 provider structured-output schema; no external tool provider default; no model-specific prompt tuning

Data and scenario construction

United States

The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit, single-family, single-Supplemental Poverty Measure (SPM)-unit structure with at least one adult and a supported filing status. The 2024 Enhanced CPS source contains 41,314 households; 30,173 (73.0%) pass the filter and form the eligible draw. The 27.0% excluded by the filter include multi-tax-unit households (e.g., adult roommates), multi-family households, multi-SPM-unit households, and households whose head reports a filing status outside the supported set. These excluded compositions are exactly the kind of cases where federal/state credit allocations and benefit-unit rules become hardest, so the eligible draw is a tractable subset rather than the full distribution of US households. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives. Filing status is not stated in the prompt; the reference computation infers it from tax-unit role flags. Models therefore see the same household facts that drive the reference filing-status assignment, but they do not receive that assignment as a label.

The current US release requests 18 scored output groups spanning federal income tax, refundable credits, payroll and self-employment tax, state and local income tax, Supplemental Nutrition Assistance Program (SNAP), Supplemental Security Income (SSI), Temporary Assistance for Needy Families (TANF), school-meal eligibility, and person-level coverage eligibility for the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), Medicaid, the Children’s Health Insurance Program (CHIP), Medicare, Head Start, and Early Head Start. The source run also requested the ACA Premium Tax Credit (PTC), but explanation audits showed that the prompt could be misleading when households lacked plan-specific Marketplace information. We therefore preserve those raw responses but exclude PTC from the canonical scored leaderboard until the prompt contract is revised. Local income tax is retained as a displayed requested output, but it receives zero default population-impact weight in this snapshot because the full Enhanced CPS source used for weighting has no positive modeled local-income-tax records.

The output scope is intentionally narrower than the full PolicyEngine model. Table 3 summarizes the inclusion rule. The benchmark asks for WIC eligibility rather than a WIC dollar amount; WIC dollar values are used only as impact-weight proxies for coverage flags, not as requested model outputs.

Table 3: Output-selection rationale.
Scope decision Rationale
0 Included Direct tax, credit, benefit, health-support, and coverage outputs that a household-facing model could plausibly be asked to estimate from household facts.
1 Excluded Intermediate tax bases, payroll subcomponents, and outputs that mainly require unavailable history, restricted local market data, restricted program-administration data, or take-up assignment rather than rule calculation.
2 Binary coverage outputs Requested as 0/1 eligibility flags and scored as classification tasks; their dollar values are used only as impact-weight proxies, not as requested model outputs.
3 WIC The benchmark asks for person-level WIC eligibility. It does not ask models to estimate a WIC dollar amount.

United Kingdom

The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. The resulting enhanced_cps_2025.h5 artifact is checked in to the public PolicyEngine/policyengine-uk-data GitHub repository; the manuscript pins commit 9514dfb7ec607897c9f7122a2e073b922c9fd8b6 so that a third party can retrieve the exact file used here. The artifact contains 28,532 households; 28,502 (99.9%) pass the eligibility filter that retains households with one benefit unit and one or two adults. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.

The UK data path is more synthetic than the enhanced FRS pipeline and inherits limitations from cross-country transfer, calibration choices, and the subset of variables that can be made prompt-visible. It supports the current public cross-country benchmark, but it is not equivalent to an enhanced-FRS-based benchmark and should not be used to make population-representative claims about UK households (Sutherland and Figari 2013).

Reference-output credibility

PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10’s data science team adapted PolicyEngine’s open-source microsimulation model for experimental policy simulation, with validation against external projections before use (Woodruff 2026; Ghenis 2026). In the US, PolicyEngine (2024) reports matching the National Bureau of Economic Research (NBER) TAXSIM-35 model (Feenberg and Coutts 1993) to the cent on the vast majority of cases for the 2021 tax year across hundreds of thousands of tax units per state, with state-specific differences documented in the integration tests. We do not restate that comparison as a single percentage because the published source uses qualitative phrasing rather than a headline accuracy number. PolicyEngine has also signed a memorandum of understanding (Ghenis and Makarchuk 2025) with the Federal Reserve Bank of Atlanta for future validation work against its Policy Rules Database (2026). The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.

This does not make PolicyEngine infallible. During benchmark development, we performed a manual, developer-led discrepancy review, with LLM assistance used to triage and summarize candidate cases surfaced by model explanations. Table 4 summarizes the main discrepancy classes reviewed before the frozen snapshot. In those reviewed classes, discrepancies reflected model mistakes, prompt ambiguity later corrected, or upstream data/model issues fixed before the frozen snapshot; the reviewed discrepancy classes did not identify unresolved PolicyEngine reference-output defects.

After freezing the snapshot and completing response-contract repairs, we annotated every wrong model-output row and every scenario-output case with at least one wrong row. Table 5 reports that all 6,771 rows receiving less than full score have row-level and case-level annotations. The final row-level source is llm_error; no frozen-snapshot wrong row remains classified as a prompt ambiguity, unresolved reference-model issue, unresolved reference-data issue, parse-contract failure, or needs-review item. The audit is still developer-led rather than independent external validation, but it is exhaustive over scored misses in the repaired frozen snapshot.

Table 4: Development discrepancy review before the frozen snapshot.
Discrepancy class Review outcome
0 US federal credits Reviewed cases where explanations omitted EITC or refundable CTC; discrepancies reflected model credit-treatment errors, not identified PolicyEngine errors.
1 US SNAP, SSI, and Medicaid Reviewed false-zero and false-positive explanations; common model failures used broad asset or income heuristics inconsistent with the supplied facts and the reference rules.
2 US payroll and overtime Reviewed overtime-related cases during development; upstream data/model fixes and prompt clarifications were applied before the frozen snapshot.
3 UK Child Benefit Reviewed HICBC-related misses; the benchmark now asks for gross Child Benefit before HICBC, with HICBC included in Income Tax.
4 UK National Insurance Reviewed cases where models subtracted pension contributions or used wrong thresholds/rates; no reference-output defect was identified in the reviewed cases.
5 UK UC, Pension Credit, and PIP Reviewed positive-case misses and false positives; explanations usually showed broad capital, age, or award-level heuristics rather than simulator-reference defects.
Table 5: Exhaustive annotation coverage for scored misses in the frozen snapshot.
Country Wrong rows audited LLM response errors Unresolved prompt/reference rows Missing row annotations Missing case notes
0 US 4039 4039 0 0 0
1 UK 2732 2732 0 0 0
2 Total 6771 6771 0 0 0

Results

United States leaderboard

The US leaderboard for the frozen manuscript snapshot is shown in Table 6. The top three models in that snapshot are GPT 5.5 (79.4 within-1%, 76.9 exact, 93.5 bounded), Grok 4.20 (77.6 within-1%, 76.7 exact, 92.3 bounded), and Gemini 3.1 Pro Preview (76.9 within-1%, 75.7 exact, 91.8 bounded).

Table 6: Top US benchmark models in the frozen manuscript snapshot.
Model Within 1% % Exact % Within 10% Bounded score Parsed Total
0 GPT 5.5 79.4 76.9 88.9 93.5 2088 2088
1 Grok 4.20 77.6 76.7 87.4 92.3 2088 2088
2 Gemini 3.1 Pro Preview 76.9 75.7 84.6 91.8 2088 2088
3 Claude Sonnet 4.6 76.8 76.6 86.6 91.5 2088 2088
4 Gemini 3 Flash Preview 76.2 75.3 83.9 90.1 2088 2088

United Kingdom leaderboard

The UK leaderboard for the frozen manuscript snapshot is shown in Table 7. The top three models in that snapshot are GPT 5.5 (59.0 within-1%, 53.1 exact, 90.9 bounded), Gemini 3.1 Pro Preview (57.1 within-1%, 53.0 exact, 91.1 bounded), and Claude Sonnet 4.6 (56.5 within-1%, 52.3 exact, 89.6 bounded).

Table 7: Top UK benchmark models in the frozen manuscript snapshot.
Model Within 1% % Exact % Within 10% Bounded score Parsed Total
0 GPT 5.5 59.0 53.1 82.4 90.9 700 700
1 Gemini 3.1 Pro Preview 57.1 53.0 82.2 91.1 700 700
2 Claude Sonnet 4.6 56.5 52.3 78.2 89.6 700 700
3 Gemini 3.5 Flash 56.2 52.4 77.8 89.2 700 700
4 Grok 4.20 55.7 52.8 80.9 90.0 700 700

Simple baselines

Simple baselines help interpret score levels in a zero-heavy benchmark. Table 8 reports an always-zero response and a median-reference-by-output response on the same frozen household sample. These are not model competitors; they show how much of either metric can be earned without household-specific policy calculation — and they make clear that the within-1% headline and exact-match deployability bar are not trivial bars to clear above the always-zero baseline on positive-reference cases.

Table 8: Simple baseline scores on the frozen manuscript snapshot.
Baseline US score UK score
0 Always zero 62.3 54.0
1 Median reference by output 58.6 51.1

Sensitivity to benchmark view

The primary leaderboard gives each household equal weight within each country. Table 9 reports several alternative views, including the older equal-output-group score. The top model is stable across several country views, but lower ranks move across sensitivity views. Country scores are therefore best read with the weighting choice visible rather than collapsed into one cross-country ordering.

Table 9: Leaderboard sensitivity to alternative scoring views.
Country View Rank 1 Rank 2 Rank 3
0 US Within-1% headline GPT 5.5 (79.4) Grok 4.20 (77.6) Gemini 3.1 Pro Preview (76.9)
1 US Bounded score GPT 5.5 (93.5) Grok 4.20 (92.3) Gemini 3.1 Pro Preview (91.8)
2 US Equal-output-group score GPT 5.5 (95.1) Gemini 3.1 Pro Preview (94.6) Grok 4.20 (94.5)
3 US Amount outputs only GPT 5.5 (93.8) Grok 4.20 (92.0) Gemini 3.1 Pro Preview (92.0)
4 US Positive reference cases only Claude Opus 4.7 (76.7) GPT 5.5 (76.2) Gemini 3.5 Flash (75.9)
5 US Zero reference cases only Gemini 3.1 Pro Preview (98.9) Grok 4.20 (98.6) GPT 5.5 (98.3)
6 UK Within-1% headline GPT 5.5 (59.0) Gemini 3.1 Pro Preview (57.1) Claude Sonnet 4.6 (56.5)
7 UK Bounded score Gemini 3.1 Pro Preview (91.1) GPT 5.5 (90.9) Grok 4.20 (90.0)
8 UK Equal-output-group score GPT 5.5 (95.4) Gemini 3.1 Pro Preview (94.5) Claude Sonnet 4.6 (93.9)
9 UK Amount outputs only GPT 5.5 (95.4) Gemini 3.1 Pro Preview (94.5) Claude Sonnet 4.6 (93.9)
10 UK Positive reference cases only GPT 5.5 (79.2) Claude Sonnet 4.6 (75.0) Gemini 3.5 Flash (71.2)
11 UK Zero reference cases only Gemini 3 Flash Preview (99.2) Grok 4.1 Fast (99.2) GPT 5.5 (98.8)

The bounded score is a global-output-weighted average of row scores. Amount outputs use max(0, 1 − |pred − ref| / |ref|) when the reference is nonzero and 1{pred = 0} when the reference is zero. Binary outputs use exact 0/1 matching. The exact-match rate uses the same weighting and aggregation but replaces the amount row score with a strict indicator: 1{pred = ref} to the dollar. Each output group’s default household-impact weight is computed from a full weighting population, not from the 100 benchmark households: full PolicyEngine US Enhanced CPS for the US and full PolicyEngine UK enhanced FRS for the UK. This weighting source is separate from the UK benchmark scenario source, which remains the public calibrated transfer dataset. For each source household, the share is |ref_ij| / max(|household_net_income_i|, Σ_k |ref_ik|); shares are averaged with calibrated household weights and renormalized so output weights sum to one. The max(...) denominator anchors per-household shares to net income when net income is the dominant flow and falls back to the gross tax-benefit flow only when programs cancel each other out, so a $1 benefit to a high-earner contributes essentially zero weight and per-household shares never exceed one. Booleans carry weight through PolicyEngine’s paired per-capita value (for example medicaid_value), so eligibility calls are graded as accuracy but weighted by the dollar stake. Table 10 compares the bounded-score ranking against two opt-in alternatives shown for transparency: equal weighting (every output the same) and budget-weighted (each output’s share of total absolute reference dollars in the same full weighting population).

Table 10: Top country ranks under the bounded secondary metric across three weightings (Household, Aggregate, Equal).
Country Weighting Rank 1 Rank 2 Rank 3
0 US Household-weighted (default) GPT 5.5 (93.5) Grok 4.20 (92.3) Gemini 3.1 Pro Preview (91.8)
1 UK Household-weighted (default) Gemini 3.1 Pro Preview (91.1) GPT 5.5 (90.9) Grok 4.20 (90.0)
2 US Equal weights GPT 5.5 (95.1) Grok 4.20 (94.5) Gemini 3.1 Pro Preview (94.3)
3 UK Equal weights GPT 5.5 (95.4) Gemini 3.1 Pro Preview (94.5) Claude Sonnet 4.6 (93.9)
4 US Budget-weighted GPT 5.5 (91.0) Claude Opus 4.7 (87.9) Claude Sonnet 4.6 (86.2)
5 UK Budget-weighted Gemini 3.1 Pro Preview (89.4) GPT 5.5 (89.0) Claude Opus 4.7 (88.3)

Hardest benchmark targets

The lowest-scoring US variables are multi-step tax quantities and sparse positive-dollar benefits. As shown in Table 11, state income tax before refundable credits, federal income tax before refundable credits, SNAP, state refundable credits, and federal refundable credits are the hardest US outputs by bounded score in the frozen manuscript snapshot.

Table 11: Hardest US variables by bounded score (the secondary tracking metric, used here because both exact-match and within-1% rates on the hardest variables compress to single-digit percentages and don’t discriminate well).
Variable Score Within 1% Exact Within 10%
15 state_income_tax_before_refundable_credits 69.5 40.0 40.8 54.7
0 federal_income_tax_before_refundable_credits 74.7 39.2 37.2 55.0
13 snap 80.4 73.2 72.8 76.6
16 state_refundable_credits 82.9 80.7 80.3 81.1
1 federal_refundable_credits 85.6 76.1 75.5 79.7

The same pattern appears in the UK run. Income tax and National Insurance are the two hardest UK outputs, while zero-heavy benefits and capital gains tax score higher overall.

Table 12: Hardest UK variables by bounded score (the secondary tracking metric, used here because both exact-match and within-1% rates on the hardest variables compress to single-digit percentages and don’t discriminate well).
Variable Score Within 1% Exact Within 10%
2 income_tax 73.6 26.7 21.5 52.3
3 national_insurance 82.6 44.1 44.5 70.4
6 universal_credit 86.9 78.0 77.3 82.2
4 pension_credit 93.7 93.2 93.2 93.2
1 child_benefit 95.6 65.2 63.1 92.8

Zero and positive cases

Overall hit rates can overstate performance on sparse programs because correct zeros are common. Table 13 and Table 14 therefore report within-10% performance for all cases, positive-reference cases, and zero-reference cases. The gap is large for several outputs: models often identify that a program or tax does not apply, but miss the amount when the reference value is positive.

Table 13: US within-10% performance by zero versus positive reference cases.
Variable All cases Positive cases Zero cases
0 state_income_tax_before_refundable_credits 54.7 27.2 94.2
1 federal_income_tax_before_refundable_credits 55.0 30.3 93.7
2 snap 76.6 15.4 97.0
3 federal_refundable_credits 79.7 24.2 94.4
4 state_refundable_credits 81.1 4.9 99.0
5 payroll_tax 85.4 82.2 95.5
6 person_medicaid_eligible 89.6 74.9 95.4
7 person_head_start_eligible 92.3 84.6 92.5
Table 14: UK within-10% performance by zero versus positive reference cases.
Variable All cases Positive cases Zero cases
0 income_tax 52.3 40.2 93.0
1 national_insurance 70.4 48.9 95.7
2 universal_credit 82.2 23.4 97.9
3 child_benefit 92.8 80.7 100.0
4 pension_credit 93.2 1.3 99.1
5 capital_gains_tax 93.8 24.0 99.8
6 pip 99.0 NaN 99.0
Scatter plot comparing percent correct on zero-reference cases with percent correct on positive-reference cases by output group.
Figure 1: Output-level performance on zero-reference and positive-reference cases.

Failure modes

The benchmark surfaces a few recurring failure patterns.

First, models miss positive tax and benefit quantities more often than zero cases. In the US, state income tax before refundable credits, federal income tax before refundable credits, SNAP, state refundable credits, and federal refundable credits are the lowest-scoring outputs by bounded score. These require the model to choose the right income concepts, exclusions, program thresholds, and sequencing before applying any final subtraction.

Second, the UK benchmark shows the same split between tax calculations and many benefit outputs. Income tax and National Insurance score below the benefit outputs in the frozen manuscript snapshot. Positive Universal Credit and Pension Credit cases remain difficult, so the result should not be read as a general claim that benefits are easy.

Third, joint accuracy across interacting components is lower than marginal accuracy on either component. Table 15 shows within-10% accuracy for federal_refundable_credits, state_refundable_credits, and the conjunction of both within the same household. The joint hit rate is consistently lower than either marginal hit rate, so leaderboard scores that average across outputs understate how often a model gets a single household’s federal/state credit allocation jointly correct.

Table 15: US within-10% accuracy on federal vs state refundable credits and the household-level joint.
Model Federal within 10% State within 10% Joint within 10%
9 GPT 5.5 94.0 82.0 77.0
5 Gemini 3.1 Pro Preview 88.0 83.0 76.0
11 Grok 4.20 88.0 83.0 76.0
1 Claude Opus 4.7 87.0 78.0 71.0
2 Claude Sonnet 4.6 87.0 80.0 71.0
8 GPT 5.4 nano 79.0 81.0 71.0
6 Gemini 3.5 Flash 78.0 81.0 69.0
3 Gemini 3 Flash Preview 77.0 81.0 66.0
4 Gemini 3.1 Flash Lite Preview 75.0 81.0 66.0
12 Grok 4.3 75.0 81.0 66.0
7 GPT 5.4 mini 73.0 81.0 65.0
0 Claude Haiku 4.5 70.0 81.0 62.0
10 Grok 4.1 Fast 65.0 81.0 57.0

Fourth, structured-output reliability is part of the benchmark contract. Missing or unparseable numeric values are not dropped. Appendix A documents parser recovery, bounded full-response retries, and row-level repairs that brought the canonical manuscript snapshot to zero missing numeric values or explanations while preserving the failed attempts.

Limitations

PolicyBench is not a substitute for a production tax-and-benefit calculator. Several caveats matter:

The current paper is therefore an evaluation of model performance under a specific structured-output benchmark, not a general certification of tax or benefit competence.

Conclusion

PolicyBench shows a consistent pattern across both countries. Models often identify non-applicability, but positive tax and benefit amounts remain difficult, especially for multi-step income tax, payroll tax, National Insurance, and positive benefit cases. In the frozen manuscript snapshot, GPT 5.5 is the top-scoring model in both the US and UK.

These results support a narrow conclusion: unaided frontier models still struggle to reproduce selected household-level microsimulation outputs under a structured public benchmark. They do not show that models cannot assist policy analysis, and they do not validate PolicyEngine outputs as administrative truth. They suggest that future evaluations should separate no-tool estimation, tool-using system design, and reference-output validation more explicitly.

Next steps are to expand country coverage, increase frozen sample sizes, improve UK data provenance, and add protected or rotating evaluation sets so that public rankings are less exposed to open-set leakage. The benchmark should also continue reporting sensitivity views, because country scores are useful summaries only when their weighting choices are visible.

Appendix A: Structured-output audit

The benchmark requires one numeric value and one explanation for every requested output. Parse failures are benchmark failures, not missing data to drop from the denominator. Before publishing the manuscript snapshot, we audited missing numeric values and explanations, extended the parser only where explicit variable-keyed value and explanation fields were recoverable, retried broken full responses, and then ran targeted row repairs for the remaining missing rows. Table 16 summarizes that sequence.

Table 16: Structured-output parser audit and repair sequence.
Step Finding
0 Initial parse-contract audit The source run was audited for rows missing a parsed numeric value or non-empty explanation; those rows were not dropped from the denominator.
1 Parser repair The parser was extended to recover explicit `value` and non-empty `explanation` blocks from nested, escaped, or partially truncated provider JSON without scraping prose numbers.
2 Full-response retries Three bounded retry rounds targeted broken country-model-household responses and accepted only fully valid replacement responses.
3 Row-level repairs A final repair pass retried only rows still missing a parsed numeric value or non-empty explanation, using the same model, household, and output.
4 Final parse coverage The repaired manuscript snapshot has zero missing parsed numeric values and zero missing explanations across all 36,244 canonical model-output rows.
5 Preservation rule The snapshot retains response-retry and row-repair targets, attempts, accepted replacements, rejected rows, and merged prediction files.

The main full-response denominator is the country-model-household response: one model answering one household in one country. We first retried broken responses at that level, accepting a retry only when the whole response satisfied the numeric-and-explanation contract. Several outputs in the same response are mechanically related, so this full-response pass avoided mixing answer sets where possible. Table 17 summarizes the preserved retry rounds.

Table 17: Full-response retry rounds preserved with the paper snapshot.
Round Country Target responses Accepted responses Rejected responses Estimated cost
0 Round 1 US 138 32 106 $19.03
1 Round 1 UK 184 36 148 $4.94
2 Round 2 US 9 0 9 $0.70
3 Round 2 UK 32 0 32 $0.39
4 Round 3 US 106 2 104 $17.47
5 Round 3 UK 148 16 132 $4.28

The full-response pass did not eliminate every missing row. The final row-level repair pass retried the same model on the same household-output target and accepted only rows with both a parsed numeric value and non-empty explanation. These repairs are part of the canonical manuscript snapshot, not a separate leaderboard condition. Table 18 reports the row-level repair counts and confirms that the final repaired prediction files have zero missing output rows.

Table 18: Row-level repairs preserved with the paper snapshot.
Round Country Target rows Accepted row repairs Rejected row repairs Final missing rows Estimated cost
0 Round 1 US 658 658 0 0 $2.82
1 Round 1 UK 412 412 0 0 $1.15

Competing interests

The author is affiliated with PolicyEngine, which develops the microsimulation software used to produce the benchmark reference outputs.

Abbood, Auss, Zaiqiao Meng, and Nigel Collier. 2025. “Time to Revisit Exact Match.” Findings of the Association for Computational Linguistics: EMNLP 2025 (Suzhou, China), 11903–26. https://doi.org/10.18653/v1/2025.findings-emnlp.637.
Bock, Michael R., Kara Molisee, Zachary Ozer, and Sumit Shah. 2025. “TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task.” arXiv Preprint arXiv:2507.16126, ahead of print. https://doi.org/10.48550/arXiv.2507.16126.
Chen, Zhiyu, Wenhu Chen, Charese Smiley, et al. 2021. “FinQA: A Dataset of Numerical Reasoning over Financial Data.” arXiv Preprint arXiv:2109.00122. https://arxiv.org/abs/2109.00122.
Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, et al. 2021. “Training Verifiers to Solve Math Word Problems.” arXiv Preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168.
Federal Reserve Bank of Atlanta. 2026. Policy Rules Database. Federal Reserve Bank of Atlanta. https://www.atlantafed.org/what-we-study/workforce-development/advancing-careers-for-low-income-families/policy-rules-database.
Feenberg, Daniel R., and Elisabeth Coutts. 1993. “An Introduction to the TAXSIM Model.” Journal of Policy Analysis and Management 12 (1): 189–94.
Ghenis, Max. 2026. PolicyEngine Powers Rapid Policy Analysis at No 10 Downing Street. PolicyEngine. https://www.policyengine.org/us/research/policyengine-10-downing-street.
Ghenis, Max, and Pavel Makarchuk. 2025. PolicyEngine and Atlanta Fed Sign Memorandum of Understanding for Policy Rules Database Validation. PolicyEngine. https://www.policyengine.org/us/research/policyengine-atlanta-fed-mou-prd.
Guha, Neel, Julian Nyarko, Daniel E. Ho, et al. 2023. “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.” arXiv Preprint arXiv:2308.11462. https://arxiv.org/abs/2308.11462.
Hendrycks, Dan, Collin Burns, Saurav Kadavath, et al. 2021. “Measuring Mathematical Problem Solving with the MATH Dataset.” NeurIPS Datasets and Benchmarks Track. https://arxiv.org/abs/2103.03874.
Holzenberger, Nils, Benjamin Van Durme, Sarah Lawsky, and Kyle Richardson. 2021. “Factoring Statutory Reasoning as Language Understanding Challenges.” arXiv Preprint arXiv:2105.07903. https://arxiv.org/abs/2105.07903.
Nava Labs. 2025. Experimenting with AI-Powered Tools in Public Benefits. Case study. https://www.navapbc.com/case-studies/ai-tools-public-benefits.
Nava Labs. 2026. Evaluating a GenAI-Powered Assistive Chatbot for Caseworkers. Case study. https://www.navapbc.com/case-studies/evaluating-ai-assistive-chatbot-caseworkers.
PolicyEngine. 2024. PolicyEngine Launches State Income Tax Modeling Nationwide. PolicyEngine. https://www.policyengine.org/us/research/state-tax-model-beta.
Shanahan, Catherine, Emma McCarthy, Yan Zhao, et al. 2025. “Performance of LLMs on VITA Test: Potential for AI-Assisted Tax Returns for Low Income Taxpayers.” Artificial Intelligence and Law, ahead of print. https://doi.org/10.1007/s10506-025-09465-7.
Shorten, Connor, Charles Pierse, Thomas Benjamin Smith, et al. 2024. “StructuredRAG: JSON Response Formatting with Large Language Models.” arXiv Preprint arXiv:2408.11061, ahead of print. https://doi.org/10.48550/arXiv.2408.11061.
Sutherland, Holly, and Francesco Figari. 2013. EUROMOD: The European Union Tax-Benefit Microsimulation Model.” International Journal of Microsimulation 6 (1): 4–26. https://doi.org/10.34196/ijm.00075.
Woodruff, Nikhil. 2026. Informing Policy Using Micro-Simulations. No10 Innovation Fellowship. https://fellows.ai.gov.uk/articles/nikhil-woodruff-micro-simulation.
ZenML LLMOps Database. 2025. Building and Automating Comprehensive LLM Evaluation Framework for SNAP Benefits. Industry writeup. https://www.zenml.io/llmops-database/building-and-automating-comprehensive-llm-evaluation-framework-for-snap-benefits.
Zhou, Ruiwen, Wenyue Hua, Liangming Pan, et al. 2025. RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vienna, Austria), 550–72. https://doi.org/10.18653/v1/2025.acl-long.27.
Zhu, Fengbin, Wenqiang Lei, Youcheng Huang, et al. 2021. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance.” arXiv Preprint arXiv:2105.07624. https://arxiv.org/abs/2105.07624.