Skip to main content
Pre-releaseThese results are provisional. We plan to rerun PolicyBench with updated data and improved prompts before the final release.
PolicyBench

Testing how accurately language models calculate household taxes and benefits.

13Models
100Households
18Outputs
Snapshot 2026-05-20

PolicyBench leaderboard

Leaderboard

Model rankings

13 models, ranked by within-1% hit rate score (United States)

1GPT-5.5
79.4%
2Grok 4.20
77.6%
3Gemini 3.1 Pro Preview
76.9%
4Claude Sonnet 4.6
76.8%
5Gemini 3 Flash Preview
76.2%
6Claude Opus 4.7
75.8%
7Grok 4.3
75.3%
8Gemini 3.1 Flash Lite Preview
74.2%
9Gemini 3.5 Flash
74.1%
10GPT-5.4 mini
72.5%
11Claude Haiku 4.5
70.9%
12Grok 4.1 Fast
70.9%
13GPT-5.4 nano
60.9%
Optionsscoring: within 1% · cases: all cases · weighting: household · all 18 programs
Programs

Filter restricts model scoring and the program breakdown table to the selected outputs. The scenario explorer remains unfiltered so each household's full prompt stays visible. Weights rescale to 100% over the active set.

Scoring
Percent within 1% of reference.
Reference cases
All reference cells.
Weighting
Population household-impact weights — each output group's share is |ref| / max(|household_net_income|, Σ|ref|) in the full weighting population, averaged with household weights and renormalized before scoring each benchmark household. US weights use the full Enhanced CPS; UK weights use the full enhanced FRS. The UK benchmark scenarios themselves still come from the public calibrated transfer dataset.
Per-variable weights18 variables, sorted by Household
VariableHouseholdAggregateEqual
Person-level Medicaid eligibility30.29%7.41%5.56%
Federal tax before refundable credits16.70%61.48%5.56%
Payroll tax14.75%9.28%5.56%
Person-level Medicare eligibility11.63%3.96%5.56%
SNAP7.18%1.13%5.56%
State tax before refundable credits5.42%11.06%5.56%
Federal refundable credits4.44%1.34%5.56%
Self-employment tax2.57%1.91%5.56%
SSI2.14%0.79%5.56%
State refundable credits1.23%0.26%5.56%
Free school meals eligibility1.22%0.42%5.56%
Person-level Early Head Start eligibility0.61%0.30%5.56%
Person-level WIC eligibility0.48%0.15%5.56%
Person-level Head Start eligibility0.44%0.21%5.56%
TANF0.43%0.11%5.56%
Person-level CHIP eligibility0.34%0.14%5.56%
Reduced-price school meals eligibility0.13%0.04%5.56%
Local income tax0.00%0.00%5.56%
By program

Program breakdown

Bounded score by program and model (AI alone, without tools). Dollar targets use continuous relative-error partial credit; binary coverage flags use exact accuracy.

Program filterAll 18 programs

Shared with model scoring. The scenario explorer remains unfiltered so each household's full prompt stays visible. The table shows only selected outputs; model scores rescale selected weights to 100%.

ProgramClaude Opus 4.7Claude Sonnet 4.6Claude Haiku 4.5Grok 4.3Grok 4.20Grok 4.1 FastGPT-5.5GPT-5.4 miniGPT-5.4 nanoGemini 3.1 Pro PreviewGemini 3.5 FlashGemini 3 Flash PreviewGemini 3.1 Flash Lite PreviewAvg
Local income tax
100%
100%
95%
100%
100%
100%
100%
100%
100%
100%
99%
100%
96%
99%
SSI
99%
100%
98%
98%
99%
97%
100%
98%
98%
99%
99%
99%
99%
99%
Self-employment tax
99%
99%
97%
98%
99%
99%
100%
98%
95%
99%
99%
98%
97%
98%
TANF
97%
99%
98%
97%
98%
98%
98%
97%
98%
98%
98%
97%
97%
98%
Reduced-price school meals eligibility
99%
99%
93%
100%
99%
97%
97%
98%
97%
98%
97%
98%
98%
98%
Free school meals eligibility
99%
99%
92%
97%
99%
97%
100%
96%
92%
100%
99%
100%
98%
98%
Person-level WIC eligibility
96%
99%
94%
99%
99%
92%
100%
92%
97%
100%
100%
100%
97%
97%
Person-level Medicare eligibility
96%
98%
95%
99%
99%
97%
96%
95%
92%
96%
97%
97%
99%
96%
Person-level Early Head Start eligibility
93%
95%
90%
95%
98%
88%
95%
93%
95%
98%
100%
100%
93%
95%
Person-level CHIP eligibility
88%
87%
89%
95%
98%
92%
95%
90%
99%
99%
98%
98%
100%
94%
Payroll tax
98%
98%
92%
98%
98%
91%
98%
89%
69%
97%
96%
97%
98%
94%
Person-level Head Start eligibility
93%
95%
88%
90%
95%
90%
95%
81%
93%
100%
95%
93%
90%
92%
Person-level Medicaid eligibility
90%
94%
88%
93%
96%
87%
96%
85%
70%
94%
93%
95%
86%
90%
Federal refundable credits
93%
93%
76%
81%
95%
72%
97%
77%
79%
93%
87%
85%
84%
86%
State refundable credits
80%
83%
81%
81%
86%
81%
86%
81%
81%
88%
85%
83%
82%
83%
SNAP
82%
82%
73%
78%
86%
77%
85%
80%
75%
85%
83%
81%
80%
80%
Federal tax before refundable credits
88%
83%
66%
72%
82%
61%
90%
60%
60%
83%
78%
77%
71%
75%
State tax before refundable credits
75%
80%
56%
72%
78%
57%
83%
54%
48%
78%
77%
74%
70%
70%
Cells use color as a redundant cue; the percentage shown in each cell is the actual benchmark score.
<50%
50-59%
60-69%
70-79%
80-89%
90%+
Deep dive

Scenario explorer

Inspect benchmark households, reference outputs, model answers, and the exact prompt sent to every model.

Show
Provider
ProgramReference
Opus 4.7
Grok 4.3
GPT-5.5
Pro Preview
Federal tax before refundable credits$4,455
Federal refundable credits$0
Free school meals eligibilityNo
Head CHIP eligibilityNo
Head Medicaid eligibilityNo
Head Medicare eligibilityNo
Head WIC eligibilityNo
Local income tax$0
Payroll tax$3,060
Reduced-price school meals eligibilityNo
Self-employment tax$6,358
SNAP$0
Spouse CHIP eligibilityNo
Spouse Medicaid eligibilityNo
Spouse Medicare eligibilityNo
Spouse WIC eligibilityNo
SSI$0
State tax before refundable credits$0
State refundable credits$0
TANF$0
Explanation and audit coverage

260 of 260 model-output rows for this household include explanation text returned by the model. 24 rows include developer audit notes for incorrect predictions, and 24 incorrect rows include case-level notes comparing wrong models on the same household-output target.

LLM: 24
Exact prompt
Full household batch contract for all benchmark outputs
Provider-specific structured-output transport, no external tool
Estimate the requested tax and benefit outputs using only the household facts below. All listed people live together and are in one household group for tax and benefit calculations. All listed facts describe the full tax-benefit year. Treat demographic, work, student, disability, housing, health coverage, and household-composition facts as constant throughout the tax-benefit year, with no within-year income volatility or status changes. Gross wage and salary amounts are annual totals, including any overtime pay; hourly wage is a straight-time rate when listed. Treat any unlisted numeric input as 0 and any other unlisted household fact, boolean, or status input as false. Assume tax filing and program take-up when required. Do not infer unlisted income, expenses, assets, benefit receipt, rent, or health coverage.

Household:
- state: TX
- tax year: 2026

Head:
- age: 63
- has employer-sponsored insurance
- usual weekly hours worked: 45
- other medical expenses: $300
- over-the-counter health expenses: $20
- real estate taxes: $9,500
- self-employment income: $45,000

Spouse:
- age: 61
- gross wages and salaries: $40,000
- bank account assets: $400
- employer sponsored insurance premiums: $18,208
- has employer-sponsored insurance
- health insurance premiums excluding Medicare Part B: $3,000
- hourly wage: $19
- usual weekly hours worked: 40
- other health insurance premiums: $3,000
- other medical expenses: $6,000
- over-the-counter health expenses: $1,200

Household inputs:
- household vehicles value: $40,500

Provide the following policy quantities for this household:
- federal_income_tax_before_refundable_credits: federal individual income tax after nonrefundable credits and before refundable credits. This subtracts nonrefundable credits actually used, including CDCC and the nonrefundable portion of CTC or other credits when applicable; it does not subtract EITC or refundable portions of credits such as refundable CTC
- federal_refundable_credits: total refundable federal income tax credits, including EITC and refundable portions of credits such as refundable CTC when applicable; exclude the ACA Premium Tax Credit
- payroll_tax: annual household employee-side payroll tax: employee Social Security tax, employee Medicare tax, Additional Medicare Tax, and mandatory employee state payroll taxes. Exclude employer payroll taxes, FUTA, employer unemployment-insurance taxes, and self-employment tax
- self_employment_tax: annual self-employment tax liability, excluding employee payroll taxes and Additional Medicare Tax
- state_income_tax_before_refundable_credits: state individual income tax after nonrefundable credits and before refundable credits, excluding local income and payroll taxes
- state_refundable_credits: total refundable state individual income tax credits
- local_income_tax: annual local income, wage, and earnings tax liability in the separate local-income-tax output: NYC income tax, Philadelphia wage tax, Kansas City earnings tax, and St. Louis earnings tax where applicable
- snap: annual SNAP (food stamps) benefit amount
- ssi: annual Supplemental Security Income (SSI) amount
- tanf: annual Temporary Assistance for Needy Families (TANF) benefit amount
- head_wic_eligible: whether Head is eligible for WIC (1 if yes, 0 if no)
- spouse_wic_eligible: whether Spouse is eligible for WIC (1 if yes, 0 if no)
- head_medicaid_eligible: whether Head is eligible for Medicaid under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- spouse_medicaid_eligible: whether Spouse is eligible for Medicaid under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- head_chip_eligible: whether Head is eligible for CHIP under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- spouse_chip_eligible: whether Spouse is eligible for CHIP under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- head_medicare_eligible: whether Head is eligible for Medicare (1 if yes, 0 if no)
- spouse_medicare_eligible: whether Spouse is eligible for Medicare (1 if yes, 0 if no)
- free_school_meals_eligible: whether PolicyEngine returns positive annual free school meal support for the household (1 if yes, 0 if no; reduced-price meals do not count as 1)
- reduced_price_school_meals_eligible: whether PolicyEngine returns positive annual reduced-price school meal support for the household (1 if yes, 0 if no; free meals do not count as 1)

Use the `submit_outputs` function exactly once. Return an `outputs` object with every requested quantity keyed by variable name. Each requested key must map to an object with a numeric `value` and a non-empty, specific, concise `explanation`. Each explanation must support the numeric value submitted for the same variable in `outputs`. If an explanation mentions a final amount, that amount must match the corresponding `outputs` value. Do not write that you will use one value while submitting a different value. Do not include scratch work, abandoned calculations, or corrections. End each explanation with `value = X`, where X exactly matches the numeric `value` field. For 1/0 eligibility outputs, submit 1 only when the explanation says eligible or yes, and submit 0 only when it says not eligible or no. Use the exact variable names as keys inside `outputs`. Include every requested key exactly once in `outputs`, even if the value is 0. Put only numeric values in `value`, with no dollar signs, commas, or explanatory text. Do not rely on plain text for the final answers. If an answer is a currency amount, give the annual amount. If an answer is a rate, give a decimal (e.g. 0.25 for 25%).
Household #000
Full household facts

Exactly the household section the models see at the top of the prompt — verbatim, no summarization.

Household:
- state: TX
- tax year: 2026

Head:
- age: 63
- has employer-sponsored insurance
- usual weekly hours worked: 45
- other medical expenses: $300
- over-the-counter health expenses: $20
- real estate taxes: $9,500
- self-employment income: $45,000

Spouse:
- age: 61
- gross wages and salaries: $40,000
- bank account assets: $400
- employer sponsored insurance premiums: $18,208
- has employer-sponsored insurance
- health insurance premiums excluding Medicare Part B: $3,000
- hourly wage: $19
- usual weekly hours worked: 40
- other health insurance premiums: $3,000
- other medical expenses: $6,000
- over-the-counter health expenses: $1,200

Household inputs:
- household vehicles value: $40,500
Failure modes

Where models still break

The hardest part of PolicyBench is not saying when a program is zero. It is getting the positive amount right for the households that actually qualify. The cards below split those cases apart so the benchmark is not flattered by easy zero-answer rows.

How to read these cards

These cards are intentionally stricter than the aggregate leaderboard but still use within-10% accuracy for dollar-valued programs so positive cases stay interpretable. Positive-amount cases is the harder and more informative number for benefits and refundable credits. For binary coverage flags, the cards compare positive and negative class accuracy.

Dollar target
State tax before refundable credits
Overall 54.7%
Positive-amount cases27.2%
Zero-amount cases94.2%
With children47.0%
Low income69.5%
High income55.4%
Underpredict share on positives51.0%
Dollar target
Federal tax before refundable credits
Overall 55.0%
Positive-amount cases30.3%
Zero-amount cases93.7%
With children59.2%
Low income88.8%
High income36.2%
Underpredict share on positives41.2%
Dollar target
SNAP
Overall 76.6%
Positive-amount cases15.4%
Zero-amount cases97.0%
With children68.3%
Low income30.5%
High income100.0%
Underpredict share on positives79.7%
Dollar target
Federal refundable credits
Overall 79.7%
Positive-amount cases24.2%
Zero-amount cases94.4%
With children47.6%
Low income61.8%
High income94.9%
Underpredict share on positives78.0%
Dollar target
State refundable credits
Overall 81.1%
Positive-amount cases4.9%
Zero-amount cases99.0%
With children74.0%
Low income62.4%
High income100.0%
Underpredict share on positives91.5%
Dollar target
Payroll tax
Overall 85.4%
Positive-amount cases82.2%
Zero-amount cases95.5%
With children84.9%
Low income92.0%
High income75.6%
Underpredict share on positives27.7%
Household boolean
Person-level Medicaid eligibility
Overall 89.6%
Positive households74.9%
Negative households95.4%
With children88.7%
Low income80.9%
High income96.4%
Household boolean
Person-level Head Start eligibility
Overall 92.3%
Positive households84.6%
Negative households92.5%
With children92.3%
Low income93.4%
High income97.7%
Household boolean
Person-level CHIP eligibility
Overall 94.1%
Positive householdsn/a
Negative households94.1%
With children86.8%
Low income89.7%
High income99.5%
Household boolean
Person-level Early Head Start eligibility
Overall 94.9%
Positive households88.5%
Negative households95.5%
With children94.9%
Low income96.7%
High income93.8%
What the error reads show

These expanders summarize recurring miss patterns from direct reads of model answers and explanations. They sit here with failure modes because they describe why the low-scoring program slices break.

State tax before refundable credits

This output usually fails when models import rough federal or flat-rate logic into state-specific tax bases.

Avg 70%
Common misses
  • In small-liability cases like scenarios_055 and _060, several models overshoot by a wide margin relative to the reference value.
  • In large cases like scenario_042, the main failure is still the wrong state tax base rather than the final credit step.
Federal tax before refundable credits

This target isolates federal income tax after nonrefundable credits but before refundable credits.

Avg 75%
Common misses
  • It subtracts nonrefundable credits actually used, such as CDCC and the nonrefundable part of CTC when applicable.
  • It leaves EITC and refundable credit portions, such as refundable CTC, for the refundable-credits output.
SNAP

Positive SNAP cases are the main miss; many models zero them out using raw asset or net-worth heuristics.

Avg 80%
Common misses
  • In scenarios_035 and _047, several models return $0 on households with reference values above $11,000.
  • In scenario_092, Gemini Pro cites SNAP asset limits and returns $0 even though the visible prompt inputs do not support that denial.
Federal refundable credits

This target captures the refundable federal credit side of the income-tax calculation.

Avg 86%
Common misses
  • It includes EITC and refundable portions of credits such as refundable CTC when applicable.
  • It keeps refundable income-tax credits separate from the nonrefundable-credit target.
State refundable credits

Most rows are easy zeros; the informative misses are the few positive state credits that models leave at zero.

Avg 83%
Common misses
  • In Colorado scenario_090, several models return $0 against a $6,836 reference value.
  • When models do predict a positive state credit, they often derive it from a rough federal-credit ratio instead of the state program itself.
Payroll tax

This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.

Avg 94%
Common misses
    Person-level Medicaid eligibility

    Models often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.

    Avg 90%
    Common misses
    • In scenarios_054 and _076, many models return 0 when the reference flag is 1.
    • The errors are not just arithmetic. They reflect the wrong eligibility pathway being chosen from the household facts.
    Person-level Head Start eligibility

    This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.

    Avg 92%
    Common misses
      Person-level CHIP eligibility

      Models often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.

      Avg 94%
      Common misses
      • In scenarios_054 and _076, many models return 0 when the reference flag is 1.
      • The errors are not just arithmetic. They reflect the wrong eligibility pathway being chosen from the household facts.
      Person-level Early Head Start eligibility

      This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.

      Avg 95%
      Common misses
        Methodology

        How the United States benchmark works

        PolicyBench measures a no-tools task: how well frontier models can estimate person- and household-level tax and benefit outputs from the prompt alone while following a structured response contract. This app shows the current no-tools US benchmark on a fixed test set, with PolicyEngine reference outputs computed by PolicyEngine-US for tax year 2026.

        100
        Enhanced CPS households
        18
        Scored variables
        2,088
        Model-output targets
        13
        Frontier models
        Task
        Each model sees the same household description and must return all scored outputs plus a short explanation for each output in one response, with no tool use. The exact provider-specific prompts are visible in the scenario explorer, so you can inspect the contract instead of inferring it.
        Open-set status
        The public scenario explorer exposes prompts and PolicyEngine reference outputs, so future model releases or fine-tunes could learn from the released cases. Treat this leaderboard as a public preview; protected held-out claims would require a separate rotating evaluation set.
        Households
        The US benchmark samples households from the Enhanced CPS with a fixed seed. The current set is restricted to households with a single federal tax unit, a single family, and a single benefit-calculation unit. Adult dependents remain in scope when they satisfy those restrictions. Ages, roles, income sources, and other nonzero promptable inputs are carried through into both the prompt and the PolicyEngine-US input; filing status is inferred from household structure.
        Reference outputs
        PolicyEngine-US computes the PolicyEngine reference output for every household-variable pair in tax year 2026. The displayed variables define the benchmark scope for this snapshot.
        Output selection
        The benchmark includes direct tax, credit, benefit, health-support, and coverage outputs that can plausibly be estimated from household facts. It excludes intermediate tax bases, payroll subcomponents, and outputs that mainly require unavailable history, restricted local market data, or program take-up assignment. WIC is scored as person-level eligibility, not as a dollar amount. Local income tax is retained as a displayed requested output, but currently receives zero default population-impact weight because the full Enhanced CPS source has no positive modeled local-income-tax records. The source run also requested the ACA Premium Tax Credit, but explanation audits showed the prompt could be misleading when households lacked plan-specific Marketplace information, so it is preserved in raw responses and excluded from the scored leaderboard.
        Scoring and weighting
        The public leaderboard ranks models by the within-1% hit rate using population household-impact weights. For each household-output row, the within-1% indicator is 1 when a currency answer is within 1% of the PolicyEngine reference value, with a one-currency-unit tolerance when the reference is zero. Binary eligibility flags are requested as integer 0/1 outputs and require exact 0/1 matching. The secondary bounded score uses max(0, 1 − |pred − ref| / |ref|)when the reference is nonzero and exact zero matches when the reference is zero for amount outputs, and the same exact 0/1 rule for binary outputs. Each full source household's per-output share is |ref| / max(|household_net_income|, Σ |ref|), a value in [0, 1] that's strictly less than one when net income dominates the gross tax-benefit flow and equals one only when programs cancel each other out. Those shares are averaged using calibrated household weights in the full weighting population, then renormalized so the output weights sum to one. US weights use the full Enhanced CPS; UK weights use the full enhanced FRS. This weighting source is separate from the UK benchmark scenarios, which use the public calibrated transfer dataset. The weights are then applied to the fixed benchmark households and renormalized within each household over requested outputs. Person-level eligibility flags like Medicaid carry weight through PolicyEngine's paired per-capita value (e.g. medicaid_value), so the LLM is graded only on the boolean call itself. Missing or unparseable answers count as misses through the coverage multiplier. The leaderboard reports within-1% as the headline, exact match as the deployability bar, and bounded score, amount accuracy, and participation accuracy as diagnostic companions. Equal-weight and budget-weighted variants are reported alongside for transparency. The leaderboard is a point estimate on this fixed test set.
        Sensitivity checks
        The manuscript reports alternative ranking views for equal-output groups, amount-only outputs, binary coverage, positive-reference cases, zero-reference cases, and country-only results. In the equal-output-group view, person-level outputs are grouped by program before the country average. These checks are used to interpret rank stability; they do not replace the public within-1% leaderboard.
        Impact weighting
        Binary coverage flags have 0/1 labels, but a 0/1 label is not their economic impact. Their leaderboard weights therefore come from PolicyEngine value proxies where available, such as estimated health coverage or nutrition-program value, rather than from the binary label itself. When every reference impact in a household is zero, the household falls back to equal output weights.
        Current benchmark scope
        Latest United States run in this app evaluates GPT-5.5, Grok 4.20, Gemini 3.1 Pro Preview, Claude Sonnet 4.6, Gemini 3 Flash Preview, Claude Opus 4.7, Grok 4.3, Gemini 3.1 Flash Lite Preview, Gemini 3.5 Flash, GPT-5.4 mini, Claude Haiku 4.5, Grok 4.1 Fast, GPT-5.4 nano on 2,088 scored outputs.
        Fixed test set, no tools, US tax year 2026 / UK fiscal year 2026-27
        Federal refundable creditsFederal taxFederal tax before refundable creditsFederal taxFree school meals eligibilityCoverageLocal income taxLocal taxPayroll taxPayroll taxPerson-level CHIP eligibilityCoveragePerson-level Early Head Start eligibilityCoveragePerson-level Head Start eligibilityCoveragePerson-level Medicaid eligibilityCoveragePerson-level Medicare eligibilityCoveragePerson-level WIC eligibilityCoverageReduced-price school meals eligibilityCoverageSelf-employment taxPayroll taxSNAPBenefitsSSIBenefitsState refundable creditsState taxState tax before refundable creditsState taxTANFBenefits