Benchmarking no-tool tax-and-benefit estimation in frontier language models
Max Ghenis
Abstract
PolicyBench evaluates whether frontier language models can estimate household tax and benefit outputs from household facts without tools. The benchmark covers the United States and the United Kingdom, uses sampled US Enhanced Current Population Survey (CPS)-derived scenarios and a public UK calibrated transfer dataset, and reports three complementary scores: a within-1% hit rate as the headline ranking metric, an exact-match rate as the deployability bar (predictions are graded only if they match the PolicyEngine reference to the dollar or to the eligibility flag), and a continuous bounded score that awards partial credit for close answers. The headline metric is within-1% rather than exact match because the panels are zero-inflated — 71% of UK reference outputs and 82% of US reference outputs are exact zeros — which compresses top-of-leaderboard exact-match rates near the share of zero cases a hedging model can predict and makes within-1% the more discriminating ranking on the positive-reference cases that drive quality differences. The manuscript snapshot is a 100-household-per-country public preview, not a protected held-out leaderboard: the current public scenario explorer exposes prompts and reference outputs, so open-set leakage is a central limitation of any public ranking. In the frozen snapshot used for this manuscript (2026-05-20), GPT 5.5 is top-scoring on within-1% in both country tables at 79.4 in the US and 59.0 in the UK. In both countries, multi-step tax quantities and positive-dollar benefit cases are materially harder than zero cases. The live benchmark is available at https://policybench.org.
Introduction
Household tax-and-benefit estimation sits between arithmetic and policy discussion. Each case has numeric labels, but generating those labels requires filing status, household composition, income concepts, program thresholds, and jurisdiction-specific rules. PolicyBench measures whether models can map household records to those outputs.
Most current language-model evaluations do not test this mapping directly. General math benchmarks emphasize symbolic manipulation and exact final answers (Cobbe et al. 2021; Hendrycks et al. 2021), while generic question-answering benchmarks emphasize recall or instruction following. PolicyBench instead asks whether models can transform household facts into policy outputs without access to tools or simulators.
This matters for public-facing calculators, analyst workflows, and tax-preparation or screening systems. The benchmark is intended to separate verbal policy fluency from household-level quantitative prediction.
Related work
The most relevant prior work is tax and statutory reasoning. SARA frames statutory interpretation, including tax-relevant rule application, as a language-understanding problem (Holzenberger et al. 2021). LegalBench broadens this view and includes tax-oriented numeric tasks such as sara_numeric(Guha et al. 2023). RuleArena evaluates rule-guided reasoning in regulation-style settings (Zhou et al. 2025), and TaxCalcBench evaluates frontier models on tax calculation from structured return-like inputs (Bock et al. 2025). Shanahan et al. (2025) report a similar split on the Volunteer Income Tax Assistance (VITA) test: models do better on tax knowledge questions than on open-ended calculation tasks.
PolicyBench also sits within a broader literature on quantitative reasoning benchmarks. Canonical math evaluations such as GSM8K and MATH normalized exact final-answer scoring for numeric tasks (Cobbe et al. 2021; Hendrycks et al. 2021). Recent work has questioned pure exact-match scoring in numeric QA and temporal reasoning, arguing for error-aware metrics instead (Abbood et al. 2025). PolicyBench treats this as a measurement design choice rather than a forced pick: it leads with a within-1% hit rate as the headline ranking metric, retains the exact-match rate as the deployability bar (a tax filer or benefits caseworker can’t ship “close”), and reports a bounded continuous score that awards relative-error partial credit as a secondary tracking diagnostic. The headline choice is driven by zero inflation in the benchmark panels — particularly in the UK, where 71% of reference outputs are exact zeros — which compresses exact-match leaderboards near the share of zero cases a hedging model can predict and gives within-1% more ranking power on the positive-reference cases.
Finance-domain benchmarks provide another relevant precedent. FinQA and TAT-QA both show that realistic quantitative reasoning often requires combining structured numbers with domain knowledge rather than solving stylized school-math problems (Chen et al. 2021; Zhu et al. 2021). PolicyBench differs in that its reference outputs are generated by executable tax-benefit microsimulation rather than annotated document questions, but the underlying motivation is similar: domain-specific quantitative reasoning deserves dedicated evaluation.
Finally, PolicyBench depends on two infrastructure literatures that are not themselves large language model (LLM) benchmarks. One is structured-output reliability, since the benchmark relies on multi-output JSON responses and parse coverage rather than free-form prose (Shorten et al. 2024). The other is tax-benefit microsimulation, where systems such as EUROMOD provide the methodological precedent for evaluating policy rules over household microdata (Sutherland and Figari 2013). There is also operational work on applying and evaluating AI systems in public-benefits settings, including caseworker-assist and Supplemental Nutrition Assistance Program (SNAP)-focused evaluations (Nava Labs 2026, 2025; ZenML LLMOps Database 2025). PolicyBench combines these strands into a public cross-model benchmark over household-level tax and benefit outputs.
Benchmark design
PolicyBench asks models to predict all benchmark outputs for a household in a single structured response. One response per household reduces repeated prompt cost and keeps the task at the household-return level rather than turning it into a sequence of unrelated one-output calls.
Primary metric: within 1%
The primary headline metric is the within-1% hit rate: the share of requested outputs the model gets within one percent of the PolicyEngine reference value. For currency amounts, “within 1%” means |pred − ref| ≤ 0.01 × |ref| when the reference is nonzero and |pred| ≤ 1 (the same one-currency-unit absolute tolerance used by exact match) when the reference is zero. Binary outputs are requested as integer 0/1 eligibility flags and scored identically to the exact-match metric.
Within-1% is the headline because the benchmark panels are zero-inflated. In the frozen manuscript snapshot, 71% of UK reference outputs and 82% of US reference outputs are exact zeros — most sampled households are not eligible for any given program, owe no capital gains tax, do not phase into a refundable credit, and so on. A model that hedges to zero on every requested output therefore earns full exact credit on every zero case, which compresses top-of-leaderboard exact-match rates near the share of zero cases the panel contains. In the UK panel this compression is severe: the top seven models in the manuscript snapshot are tied within 0.6 percentage points on exact match but span 4.6 percentage points on within-1%, because within-1% credits close-but-not-exact predictions on the 29% of positive-reference cases where rule comprehension is actually tested. Within-1% inherits the same one-currency-unit tolerance on zeros, so a hedging baseline gets no extra credit from the relaxed threshold; it is the partial credit on positive-reference cases that yields the discriminating signal.
The within-1% rate is aggregated to a household score and then to a country score, with the same population-derived output weights that the secondary bounded score uses. Equal household weight within each country preserves comparability across scenarios and slices. We report US and UK leaderboards separately rather than collapsing them into a combined cross-country rank.
Deployability bar: exact match
PolicyBench retains the exact-match rate as the deployability bar: the share of requested outputs the model gets right to the dollar (or, for eligibility flags, to the boolean). A tax filer, a benefit estimator, or a caseworker cannot ship “close” — a prediction off by $50 is no more usable than one off by $500, because neither matches the bottom line a downstream system or a household needs. For currency amounts, “exact” means within 1 currency unit of the reference value after numeric parsing. For binary outputs, “exact” means the parsed value is exactly the same 0/1 flag as the reference.
The exact-match rate uses the same per-output weighting as the primary within-1% metric. It is reported alongside the headline in every leaderboard table and remains the bar a production-facing system must clear. We report it as the deployability bar rather than the headline ranking because in zero-inflated panels it is dominated by zero-case behavior and discriminates less well between models that understand the policy rules and models that simply hedge to zero.
Secondary metric: bounded continuous score
PolicyBench also reports a bounded 0-100 score that awards smooth partial credit for close-but-not-exact amount answers. This is informative for tracking conceptual progress year over year while exact rates remain low, and it captures the qualitative gap between a model that is consistently close and one that is consistently far from the reference value. For amount variables, the row-level bounded score is max(0, 1 - |pred - ref| / |ref|) when the reference value is nonzero and 1{pred = 0} when the reference value is zero. For binary outputs, it uses exact 0/1 matching. The exact, within-1%, within-5%, and within-10% hit rates are reported separately as threshold diagnostics; they are not averaged to create the bounded score.
PolicyBench also tracks mean absolute error and related error metrics. These remain secondary diagnostics under the primary within-1% metric, the deployability-bar exact-match rate, and the bounded score. The three-metric design follows recent numeric-evaluation literature that questions pure exact-match scoring (Abbood et al. 2025) without abandoning exactness as the production bar.
All requested US outputs are annual amounts or annual eligibility indicators for tax year 2026; UK outputs are annual amounts or annual eligibility indicators for fiscal year 2026-27. For currency amounts, the “exact” hit rate means within 1 currency unit of the reference value after numeric parsing. Percentage-threshold hit rates use relative error when the reference value is nonzero and the same 1 currency-unit absolute tolerance when the reference value is zero. Binary outputs are parsed as numeric 0/1 eligibility flags and scored by exact classification accuracy. Missing, unparseable, or non-0/1 answers receive zero score for that requested output.
Aggregation proceeds in three steps. First, each household-output prediction receives a 0-100 score. Second, each household receives one model score: requested output rows are weighted by output-group weights constructed from the full weighting population, then renormalized within the household over outputs that are actually requested for that household. Person-level coverage flags are scored at the person row; their output-group weight is split across the relevant people in the household and is based on PolicyEngine value proxies where available. Third, country scores average those household scores with equal household weight. The paper reports US and UK leaderboards separately rather than averaging them into a combined cross-country rank. Equal-output-group scores and other alternative views are reported as sensitivity checks.
The benchmark requires each model response to include numeric answers and one explanation per requested output. Both the exact-match rate and the bounded score use only the numeric answers. Explanations are retained for scenario exploration and qualitative error analysis; they should not be interpreted as faithful traces of model reasoning.
Because explanations are required, the canonical task measures policy estimation under a public-facing structured-response contract, not isolated arithmetic accuracy. Prompt fairness is part of the benchmark contract. The current release uses one prompt template per country, with no model-specific tuning. Models receive the same household facts, the same requested outputs, and no web or tool access. The prompt states that unlisted numeric inputs are 0, unlisted boolean or status facts are false, and household characteristics are constant over the tax-benefit year. Provider-specific differences are limited to the response transport needed to obtain structured output.
Frozen snapshot and open-set status
Leaderboard positions are version-sensitive, so manuscript claims refer to the frozen source-run exports in Table 1 rather than to the live site. The committed source-run exports are the manuscript artifacts. The public site exposes the current prompts, predictions, explanations, and reference outputs for transparency. This makes the public leaderboard open-set: models, model providers, or benchmark users could learn from the released cases before later runs. Protected leaderboard claims would require a separate held-out or rotating set.
No tools, no web access, one structured response per household
32
Response contract
Numeric answer and non-empty explanation for every requested output
Table 2: Model run configuration in the frozen manuscript snapshot.
Model
PolicyBench ID
Provider ID
US parsed
UK parsed
Structured output
Decoding
0
Claude Haiku 4.5
claude-haiku-4.5
claude-haiku-4-5-20251001
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
1
Claude Opus 4.7
claude-opus-4.7
claude-opus-4-7
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
2
Claude Sonnet 4.6
claude-sonnet-4.6
claude-sonnet-4-6
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
3
Gemini 3 Flash Preview
gemini-3-flash-preview
gemini/gemini-3-flash-preview
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
4
Gemini 3.1 Flash Lite Preview
gemini-3.1-flash-lite-preview
gemini/gemini-3.1-flash-lite-preview
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
5
Gemini 3.1 Pro Preview
gemini-3.1-pro-preview
gemini/gemini-3.1-pro-preview
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
6
Gemini 3.5 Flash
gemini-3.5-flash
gemini/gemini-3.5-flash
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
7
GPT 5.4 mini
gpt-5.4-mini
gpt-5.4-mini
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
8
GPT 5.4 nano
gpt-5.4-nano
gpt-5.4-nano
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
9
GPT 5.5
gpt-5.5
gpt-5.5
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
10
Grok 4.1 Fast
grok-4.1-fast
xai/grok-4-1-fast-non-reasoning
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
11
Grok 4.20
grok-4.20
xai/grok-4.20-reasoning
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
12
Grok 4.3
grok-4.3
xai/grok-4.3
2,088/2,088
700/700
provider structured-output schema; no external tool
provider default; no model-specific prompt tuning
Data and scenario construction
United States
The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit, single-family, single-Supplemental Poverty Measure (SPM)-unit structure with at least one adult and a supported filing status. The 2024 Enhanced CPS source contains 41,314 households; 30,173 (73.0%) pass the filter and form the eligible draw. The 27.0% excluded by the filter include multi-tax-unit households (e.g., adult roommates), multi-family households, multi-SPM-unit households, and households whose head reports a filing status outside the supported set. These excluded compositions are exactly the kind of cases where federal/state credit allocations and benefit-unit rules become hardest, so the eligible draw is a tractable subset rather than the full distribution of US households. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives. Filing status is not stated in the prompt; the reference computation infers it from tax-unit role flags. Models therefore see the same household facts that drive the reference filing-status assignment, but they do not receive that assignment as a label.
The current US release requests 18 scored output groups spanning federal income tax, refundable credits, payroll and self-employment tax, state and local income tax, Supplemental Nutrition Assistance Program (SNAP), Supplemental Security Income (SSI), Temporary Assistance for Needy Families (TANF), school-meal eligibility, and person-level coverage eligibility for the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), Medicaid, the Children’s Health Insurance Program (CHIP), Medicare, Head Start, and Early Head Start. The source run also requested the ACA Premium Tax Credit (PTC), but explanation audits showed that the prompt could be misleading when households lacked plan-specific Marketplace information. We therefore preserve those raw responses but exclude PTC from the canonical scored leaderboard until the prompt contract is revised. Local income tax is retained as a displayed requested output, but it receives zero default population-impact weight in this snapshot because the full Enhanced CPS source used for weighting has no positive modeled local-income-tax records.
The output scope is intentionally narrower than the full PolicyEngine model. Table 3 summarizes the inclusion rule. The benchmark asks for WIC eligibility rather than a WIC dollar amount; WIC dollar values are used only as impact-weight proxies for coverage flags, not as requested model outputs.
Table 3: Output-selection rationale.
Scope decision
Rationale
0
Included
Direct tax, credit, benefit, health-support, and coverage outputs that a household-facing model could plausibly be asked to estimate from household facts.
1
Excluded
Intermediate tax bases, payroll subcomponents, and outputs that mainly require unavailable history, restricted local market data, restricted program-administration data, or take-up assignment rather than rule calculation.
2
Binary coverage outputs
Requested as 0/1 eligibility flags and scored as classification tasks; their dollar values are used only as impact-weight proxies, not as requested model outputs.
3
WIC
The benchmark asks for person-level WIC eligibility. It does not ask models to estimate a WIC dollar amount.
United Kingdom
The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. The resulting enhanced_cps_2025.h5 artifact is checked in to the public PolicyEngine/policyengine-uk-data GitHub repository; the manuscript pins commit 9514dfb7ec607897c9f7122a2e073b922c9fd8b6 so that a third party can retrieve the exact file used here. The artifact contains 28,532 households; 28,502 (99.9%) pass the eligibility filter that retains households with one benefit unit and one or two adults. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.
The UK data path is more synthetic than the enhanced FRS pipeline and inherits limitations from cross-country transfer, calibration choices, and the subset of variables that can be made prompt-visible. It supports the current public cross-country benchmark, but it is not equivalent to an enhanced-FRS-based benchmark and should not be used to make population-representative claims about UK households (Sutherland and Figari 2013).
Reference-output credibility
PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10’s data science team adapted PolicyEngine’s open-source microsimulation model for experimental policy simulation, with validation against external projections before use (Woodruff 2026; Ghenis 2026). In the US, PolicyEngine (2024) reports matching the National Bureau of Economic Research (NBER) TAXSIM-35 model (Feenberg and Coutts 1993) to the cent on the vast majority of cases for the 2021 tax year across hundreds of thousands of tax units per state, with state-specific differences documented in the integration tests. We do not restate that comparison as a single percentage because the published source uses qualitative phrasing rather than a headline accuracy number. PolicyEngine has also signed a memorandum of understanding (Ghenis and Makarchuk 2025) with the Federal Reserve Bank of Atlanta for future validation work against its Policy Rules Database (2026). The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.
This does not make PolicyEngine infallible. During benchmark development, we performed a manual, developer-led discrepancy review, with LLM assistance used to triage and summarize candidate cases surfaced by model explanations. Table 4 summarizes the main discrepancy classes reviewed before the frozen snapshot. In those reviewed classes, discrepancies reflected model mistakes, prompt ambiguity later corrected, or upstream data/model issues fixed before the frozen snapshot; the reviewed discrepancy classes did not identify unresolved PolicyEngine reference-output defects.
After freezing the snapshot and completing response-contract repairs, we annotated every wrong model-output row and every scenario-output case with at least one wrong row. Table 5 reports that all 6,771 rows receiving less than full score have row-level and case-level annotations. The final row-level source is llm_error; no frozen-snapshot wrong row remains classified as a prompt ambiguity, unresolved reference-model issue, unresolved reference-data issue, parse-contract failure, or needs-review item. The audit is still developer-led rather than independent external validation, but it is exhaustive over scored misses in the repaired frozen snapshot.
Table 4: Development discrepancy review before the frozen snapshot.
Discrepancy class
Review outcome
0
US federal credits
Reviewed cases where explanations omitted EITC or refundable CTC; discrepancies reflected model credit-treatment errors, not identified PolicyEngine errors.
1
US SNAP, SSI, and Medicaid
Reviewed false-zero and false-positive explanations; common model failures used broad asset or income heuristics inconsistent with the supplied facts and the reference rules.
2
US payroll and overtime
Reviewed overtime-related cases during development; upstream data/model fixes and prompt clarifications were applied before the frozen snapshot.
3
UK Child Benefit
Reviewed HICBC-related misses; the benchmark now asks for gross Child Benefit before HICBC, with HICBC included in Income Tax.
4
UK National Insurance
Reviewed cases where models subtracted pension contributions or used wrong thresholds/rates; no reference-output defect was identified in the reviewed cases.
5
UK UC, Pension Credit, and PIP
Reviewed positive-case misses and false positives; explanations usually showed broad capital, age, or award-level heuristics rather than simulator-reference defects.
Table 5: Exhaustive annotation coverage for scored misses in the frozen snapshot.
Country
Wrong rows audited
LLM response errors
Unresolved prompt/reference rows
Missing row annotations
Missing case notes
0
US
4039
4039
0
0
0
1
UK
2732
2732
0
0
0
2
Total
6771
6771
0
0
0
Results
United States leaderboard
The US leaderboard for the frozen manuscript snapshot is shown in Table 6. The top three models in that snapshot are GPT 5.5 (79.4 within-1%, 76.9 exact, 93.5 bounded), Grok 4.20 (77.6 within-1%, 76.7 exact, 92.3 bounded), and Gemini 3.1 Pro Preview (76.9 within-1%, 75.7 exact, 91.8 bounded).
Table 6: Top US benchmark models in the frozen manuscript snapshot.
Model
Within 1% %
Exact %
Within 10%
Bounded score
Parsed
Total
0
GPT 5.5
79.4
76.9
88.9
93.5
2088
2088
1
Grok 4.20
77.6
76.7
87.4
92.3
2088
2088
2
Gemini 3.1 Pro Preview
76.9
75.7
84.6
91.8
2088
2088
3
Claude Sonnet 4.6
76.8
76.6
86.6
91.5
2088
2088
4
Gemini 3 Flash Preview
76.2
75.3
83.9
90.1
2088
2088
United Kingdom leaderboard
The UK leaderboard for the frozen manuscript snapshot is shown in Table 7. The top three models in that snapshot are GPT 5.5 (59.0 within-1%, 53.1 exact, 90.9 bounded), Gemini 3.1 Pro Preview (57.1 within-1%, 53.0 exact, 91.1 bounded), and Claude Sonnet 4.6 (56.5 within-1%, 52.3 exact, 89.6 bounded).
Table 7: Top UK benchmark models in the frozen manuscript snapshot.
Model
Within 1% %
Exact %
Within 10%
Bounded score
Parsed
Total
0
GPT 5.5
59.0
53.1
82.4
90.9
700
700
1
Gemini 3.1 Pro Preview
57.1
53.0
82.2
91.1
700
700
2
Claude Sonnet 4.6
56.5
52.3
78.2
89.6
700
700
3
Gemini 3.5 Flash
56.2
52.4
77.8
89.2
700
700
4
Grok 4.20
55.7
52.8
80.9
90.0
700
700
Simple baselines
Simple baselines help interpret score levels in a zero-heavy benchmark. Table 8 reports an always-zero response and a median-reference-by-output response on the same frozen household sample. These are not model competitors; they show how much of either metric can be earned without household-specific policy calculation — and they make clear that the within-1% headline and exact-match deployability bar are not trivial bars to clear above the always-zero baseline on positive-reference cases.
Table 8: Simple baseline scores on the frozen manuscript snapshot.
Baseline
US score
UK score
0
Always zero
62.3
54.0
1
Median reference by output
58.6
51.1
Sensitivity to benchmark view
The primary leaderboard gives each household equal weight within each country. Table 9 reports several alternative views, including the older equal-output-group score. The top model is stable across several country views, but lower ranks move across sensitivity views. Country scores are therefore best read with the weighting choice visible rather than collapsed into one cross-country ordering.
Table 9: Leaderboard sensitivity to alternative scoring views.
Country
View
Rank 1
Rank 2
Rank 3
0
US
Within-1% headline
GPT 5.5 (79.4)
Grok 4.20 (77.6)
Gemini 3.1 Pro Preview (76.9)
1
US
Bounded score
GPT 5.5 (93.5)
Grok 4.20 (92.3)
Gemini 3.1 Pro Preview (91.8)
2
US
Equal-output-group score
GPT 5.5 (95.1)
Gemini 3.1 Pro Preview (94.6)
Grok 4.20 (94.5)
3
US
Amount outputs only
GPT 5.5 (93.8)
Grok 4.20 (92.0)
Gemini 3.1 Pro Preview (92.0)
4
US
Positive reference cases only
Claude Opus 4.7 (76.7)
GPT 5.5 (76.2)
Gemini 3.5 Flash (75.9)
5
US
Zero reference cases only
Gemini 3.1 Pro Preview (98.9)
Grok 4.20 (98.6)
GPT 5.5 (98.3)
6
UK
Within-1% headline
GPT 5.5 (59.0)
Gemini 3.1 Pro Preview (57.1)
Claude Sonnet 4.6 (56.5)
7
UK
Bounded score
Gemini 3.1 Pro Preview (91.1)
GPT 5.5 (90.9)
Grok 4.20 (90.0)
8
UK
Equal-output-group score
GPT 5.5 (95.4)
Gemini 3.1 Pro Preview (94.5)
Claude Sonnet 4.6 (93.9)
9
UK
Amount outputs only
GPT 5.5 (95.4)
Gemini 3.1 Pro Preview (94.5)
Claude Sonnet 4.6 (93.9)
10
UK
Positive reference cases only
GPT 5.5 (79.2)
Claude Sonnet 4.6 (75.0)
Gemini 3.5 Flash (71.2)
11
UK
Zero reference cases only
Gemini 3 Flash Preview (99.2)
Grok 4.1 Fast (99.2)
GPT 5.5 (98.8)
The bounded score is a global-output-weighted average of row scores. Amount outputs use max(0, 1 − |pred − ref| / |ref|) when the reference is nonzero and 1{pred = 0} when the reference is zero. Binary outputs use exact 0/1 matching. The exact-match rate uses the same weighting and aggregation but replaces the amount row score with a strict indicator: 1{pred = ref} to the dollar. Each output group’s default household-impact weight is computed from a full weighting population, not from the 100 benchmark households: full PolicyEngine US Enhanced CPS for the US and full PolicyEngine UK enhanced FRS for the UK. This weighting source is separate from the UK benchmark scenario source, which remains the public calibrated transfer dataset. For each source household, the share is |ref_ij| / max(|household_net_income_i|, Σ_k |ref_ik|); shares are averaged with calibrated household weights and renormalized so output weights sum to one. The max(...) denominator anchors per-household shares to net income when net income is the dominant flow and falls back to the gross tax-benefit flow only when programs cancel each other out, so a $1 benefit to a high-earner contributes essentially zero weight and per-household shares never exceed one. Booleans carry weight through PolicyEngine’s paired per-capita value (for example medicaid_value), so eligibility calls are graded as accuracy but weighted by the dollar stake. Table 10 compares the bounded-score ranking against two opt-in alternatives shown for transparency: equal weighting (every output the same) and budget-weighted (each output’s share of total absolute reference dollars in the same full weighting population).
Table 10: Top country ranks under the bounded secondary metric across three weightings (Household, Aggregate, Equal).
Country
Weighting
Rank 1
Rank 2
Rank 3
0
US
Household-weighted (default)
GPT 5.5 (93.5)
Grok 4.20 (92.3)
Gemini 3.1 Pro Preview (91.8)
1
UK
Household-weighted (default)
Gemini 3.1 Pro Preview (91.1)
GPT 5.5 (90.9)
Grok 4.20 (90.0)
2
US
Equal weights
GPT 5.5 (95.1)
Grok 4.20 (94.5)
Gemini 3.1 Pro Preview (94.3)
3
UK
Equal weights
GPT 5.5 (95.4)
Gemini 3.1 Pro Preview (94.5)
Claude Sonnet 4.6 (93.9)
4
US
Budget-weighted
GPT 5.5 (91.0)
Claude Opus 4.7 (87.9)
Claude Sonnet 4.6 (86.2)
5
UK
Budget-weighted
Gemini 3.1 Pro Preview (89.4)
GPT 5.5 (89.0)
Claude Opus 4.7 (88.3)
Hardest benchmark targets
The lowest-scoring US variables are multi-step tax quantities and sparse positive-dollar benefits. As shown in Table 11, state income tax before refundable credits, federal income tax before refundable credits, SNAP, state refundable credits, and federal refundable credits are the hardest US outputs by bounded score in the frozen manuscript snapshot.
Table 11: Hardest US variables by bounded score (the secondary tracking metric, used here because both exact-match and within-1% rates on the hardest variables compress to single-digit percentages and don’t discriminate well).
Variable
Score
Within 1%
Exact
Within 10%
15
state_income_tax_before_refundable_credits
69.5
40.0
40.8
54.7
0
federal_income_tax_before_refundable_credits
74.7
39.2
37.2
55.0
13
snap
80.4
73.2
72.8
76.6
16
state_refundable_credits
82.9
80.7
80.3
81.1
1
federal_refundable_credits
85.6
76.1
75.5
79.7
The same pattern appears in the UK run. Income tax and National Insurance are the two hardest UK outputs, while zero-heavy benefits and capital gains tax score higher overall.
Table 12: Hardest UK variables by bounded score (the secondary tracking metric, used here because both exact-match and within-1% rates on the hardest variables compress to single-digit percentages and don’t discriminate well).
Variable
Score
Within 1%
Exact
Within 10%
2
income_tax
73.6
26.7
21.5
52.3
3
national_insurance
82.6
44.1
44.5
70.4
6
universal_credit
86.9
78.0
77.3
82.2
4
pension_credit
93.7
93.2
93.2
93.2
1
child_benefit
95.6
65.2
63.1
92.8
Zero and positive cases
Overall hit rates can overstate performance on sparse programs because correct zeros are common. Table 13 and Table 14 therefore report within-10% performance for all cases, positive-reference cases, and zero-reference cases. The gap is large for several outputs: models often identify that a program or tax does not apply, but miss the amount when the reference value is positive.
Table 13: US within-10% performance by zero versus positive reference cases.
Variable
All cases
Positive cases
Zero cases
0
state_income_tax_before_refundable_credits
54.7
27.2
94.2
1
federal_income_tax_before_refundable_credits
55.0
30.3
93.7
2
snap
76.6
15.4
97.0
3
federal_refundable_credits
79.7
24.2
94.4
4
state_refundable_credits
81.1
4.9
99.0
5
payroll_tax
85.4
82.2
95.5
6
person_medicaid_eligible
89.6
74.9
95.4
7
person_head_start_eligible
92.3
84.6
92.5
Table 14: UK within-10% performance by zero versus positive reference cases.
Variable
All cases
Positive cases
Zero cases
0
income_tax
52.3
40.2
93.0
1
national_insurance
70.4
48.9
95.7
2
universal_credit
82.2
23.4
97.9
3
child_benefit
92.8
80.7
100.0
4
pension_credit
93.2
1.3
99.1
5
capital_gains_tax
93.8
24.0
99.8
6
pip
99.0
NaN
99.0
Figure 1: Output-level performance on zero-reference and positive-reference cases.
Failure modes
The benchmark surfaces a few recurring failure patterns.
First, models miss positive tax and benefit quantities more often than zero cases. In the US, state income tax before refundable credits, federal income tax before refundable credits, SNAP, state refundable credits, and federal refundable credits are the lowest-scoring outputs by bounded score. These require the model to choose the right income concepts, exclusions, program thresholds, and sequencing before applying any final subtraction.
Second, the UK benchmark shows the same split between tax calculations and many benefit outputs. Income tax and National Insurance score below the benefit outputs in the frozen manuscript snapshot. Positive Universal Credit and Pension Credit cases remain difficult, so the result should not be read as a general claim that benefits are easy.
Third, joint accuracy across interacting components is lower than marginal accuracy on either component. Table 15 shows within-10% accuracy for federal_refundable_credits, state_refundable_credits, and the conjunction of both within the same household. The joint hit rate is consistently lower than either marginal hit rate, so leaderboard scores that average across outputs understate how often a model gets a single household’s federal/state credit allocation jointly correct.
Table 15: US within-10% accuracy on federal vs state refundable credits and the household-level joint.
Model
Federal within 10%
State within 10%
Joint within 10%
9
GPT 5.5
94.0
82.0
77.0
5
Gemini 3.1 Pro Preview
88.0
83.0
76.0
11
Grok 4.20
88.0
83.0
76.0
1
Claude Opus 4.7
87.0
78.0
71.0
2
Claude Sonnet 4.6
87.0
80.0
71.0
8
GPT 5.4 nano
79.0
81.0
71.0
6
Gemini 3.5 Flash
78.0
81.0
69.0
3
Gemini 3 Flash Preview
77.0
81.0
66.0
4
Gemini 3.1 Flash Lite Preview
75.0
81.0
66.0
12
Grok 4.3
75.0
81.0
66.0
7
GPT 5.4 mini
73.0
81.0
65.0
0
Claude Haiku 4.5
70.0
81.0
62.0
10
Grok 4.1 Fast
65.0
81.0
57.0
Fourth, structured-output reliability is part of the benchmark contract. Missing or unparseable numeric values are not dropped. Appendix A documents parser recovery, bounded full-response retries, and row-level repairs that brought the canonical manuscript snapshot to zero missing numeric values or explanations while preserving the failed attempts.
Limitations
PolicyBench is not a substitute for a production tax-and-benefit calculator. Several caveats matter:
output-contract reliability required a repair workflow, and the raw failed attempts are preserved separately from the repaired canonical prediction files
zero-heavy outputs require separate positive-case interpretation
the choice of primary metric (within-1%), deployability bar (exact match), and secondary tracking metric (bounded score) is one benchmark view; all three are reported, but downstream evaluations should also consider error-magnitude metrics (mean absolute error, mean absolute percentage error) where appropriate
cross-country comparison is descriptive only; this snapshot reports separate US and UK leaderboards rather than a combined cross-country rank
the public UK calibrated transfer dataset is not equivalent to enhanced Family Resources Survey quality and is not population-representative UK microdata
the public scenario explorer exposes the current test set and reference outputs, so open-set leakage is a prominent limitation rather than a minor implementation detail
the 100-household manuscript snapshot should be treated as a preview until larger frozen runs are published
the scored-miss audit is exhaustive over this frozen snapshot, but it is developer-led and not an independent validation of every reference value
benchmark success should not be interpreted as policy-advice readiness
The current paper is therefore an evaluation of model performance under a specific structured-output benchmark, not a general certification of tax or benefit competence.
Conclusion
PolicyBench shows a consistent pattern across both countries. Models often identify non-applicability, but positive tax and benefit amounts remain difficult, especially for multi-step income tax, payroll tax, National Insurance, and positive benefit cases. In the frozen manuscript snapshot, GPT 5.5 is the top-scoring model in both the US and UK.
These results support a narrow conclusion: unaided frontier models still struggle to reproduce selected household-level microsimulation outputs under a structured public benchmark. They do not show that models cannot assist policy analysis, and they do not validate PolicyEngine outputs as administrative truth. They suggest that future evaluations should separate no-tool estimation, tool-using system design, and reference-output validation more explicitly.
Next steps are to expand country coverage, increase frozen sample sizes, improve UK data provenance, and add protected or rotating evaluation sets so that public rankings are less exposed to open-set leakage. The benchmark should also continue reporting sensitivity views, because country scores are useful summaries only when their weighting choices are visible.
Appendix A: Structured-output audit
The benchmark requires one numeric value and one explanation for every requested output. Parse failures are benchmark failures, not missing data to drop from the denominator. Before publishing the manuscript snapshot, we audited missing numeric values and explanations, extended the parser only where explicit variable-keyed value and explanation fields were recoverable, retried broken full responses, and then ran targeted row repairs for the remaining missing rows. Table 16 summarizes that sequence.
Table 16: Structured-output parser audit and repair sequence.
Step
Finding
0
Initial parse-contract audit
The source run was audited for rows missing a parsed numeric value or non-empty explanation; those rows were not dropped from the denominator.
1
Parser repair
The parser was extended to recover explicit `value` and non-empty `explanation` blocks from nested, escaped, or partially truncated provider JSON without scraping prose numbers.
2
Full-response retries
Three bounded retry rounds targeted broken country-model-household responses and accepted only fully valid replacement responses.
3
Row-level repairs
A final repair pass retried only rows still missing a parsed numeric value or non-empty explanation, using the same model, household, and output.
4
Final parse coverage
The repaired manuscript snapshot has zero missing parsed numeric values and zero missing explanations across all 36,244 canonical model-output rows.
5
Preservation rule
The snapshot retains response-retry and row-repair targets, attempts, accepted replacements, rejected rows, and merged prediction files.
The main full-response denominator is the country-model-household response: one model answering one household in one country. We first retried broken responses at that level, accepting a retry only when the whole response satisfied the numeric-and-explanation contract. Several outputs in the same response are mechanically related, so this full-response pass avoided mixing answer sets where possible. Table 17 summarizes the preserved retry rounds.
Table 17: Full-response retry rounds preserved with the paper snapshot.
Round
Country
Target responses
Accepted responses
Rejected responses
Estimated cost
0
Round 1
US
138
32
106
$19.03
1
Round 1
UK
184
36
148
$4.94
2
Round 2
US
9
0
9
$0.70
3
Round 2
UK
32
0
32
$0.39
4
Round 3
US
106
2
104
$17.47
5
Round 3
UK
148
16
132
$4.28
The full-response pass did not eliminate every missing row. The final row-level repair pass retried the same model on the same household-output target and accepted only rows with both a parsed numeric value and non-empty explanation. These repairs are part of the canonical manuscript snapshot, not a separate leaderboard condition. Table 18 reports the row-level repair counts and confirms that the final repaired prediction files have zero missing output rows.
Table 18: Row-level repairs preserved with the paper snapshot.
Round
Country
Target rows
Accepted row repairs
Rejected row repairs
Final missing rows
Estimated cost
0
Round 1
US
658
658
0
0
$2.82
1
Round 1
UK
412
412
0
0
$1.15
Competing interests
The author is affiliated with PolicyEngine, which develops the microsimulation software used to produce the benchmark reference outputs.
Abbood, Auss, Zaiqiao Meng, and Nigel Collier. 2025. “Time to Revisit Exact Match.”Findings of the Association for Computational Linguistics: EMNLP 2025 (Suzhou, China), 11903–26. https://doi.org/10.18653/v1/2025.findings-emnlp.637.
Bock, Michael R., Kara Molisee, Zachary Ozer, and Sumit Shah. 2025. “TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task.”arXiv Preprint arXiv:2507.16126, ahead of print. https://doi.org/10.48550/arXiv.2507.16126.
Chen, Zhiyu, Wenhu Chen, Charese Smiley, et al. 2021. “FinQA: A Dataset of Numerical Reasoning over Financial Data.”arXiv Preprint arXiv:2109.00122. https://arxiv.org/abs/2109.00122.
Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, et al. 2021. “Training Verifiers to Solve Math Word Problems.”arXiv Preprint arXiv:2110.14168. https://arxiv.org/abs/2110.14168.
Guha, Neel, Julian Nyarko, Daniel E. Ho, et al. 2023. “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.”arXiv Preprint arXiv:2308.11462. https://arxiv.org/abs/2308.11462.
Hendrycks, Dan, Collin Burns, Saurav Kadavath, et al. 2021. “Measuring Mathematical Problem Solving with the MATH Dataset.”NeurIPS Datasets and Benchmarks Track. https://arxiv.org/abs/2103.03874.
Holzenberger, Nils, Benjamin Van Durme, Sarah Lawsky, and Kyle Richardson. 2021. “Factoring Statutory Reasoning as Language Understanding Challenges.”arXiv Preprint arXiv:2105.07903. https://arxiv.org/abs/2105.07903.
Shanahan, Catherine, Emma McCarthy, Yan Zhao, et al. 2025. “Performance of LLMs on VITA Test: Potential for AI-Assisted Tax Returns for Low Income Taxpayers.”Artificial Intelligence and Law, ahead of print. https://doi.org/10.1007/s10506-025-09465-7.
Shorten, Connor, Charles Pierse, Thomas Benjamin Smith, et al. 2024. “StructuredRAG: JSON Response Formatting with Large Language Models.”arXiv Preprint arXiv:2408.11061, ahead of print. https://doi.org/10.48550/arXiv.2408.11061.
Sutherland, Holly, and Francesco Figari. 2013. “EUROMOD: The European Union Tax-Benefit Microsimulation Model.”International Journal of Microsimulation 6 (1): 4–26. https://doi.org/10.34196/ijm.00075.
Zhou, Ruiwen, Wenyue Hua, Liangming Pan, et al. 2025. “RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios.”Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vienna, Austria), 550–72. https://doi.org/10.18653/v1/2025.acl-long.27.
Zhu, Fengbin, Wenqiang Lei, Youcheng Huang, et al. 2021. “TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance.”arXiv Preprint arXiv:2105.07624. https://arxiv.org/abs/2105.07624.