PolicyBench, by PolicyEngine

Leaderboard

Model rankings

Model

Within 1%

1GPT-5.5

79.4%

GPT-5.5

79.4%

2Grok 4.20

77.6%

Grok 4.20

77.6%

3Gemini 3.1 Pro Preview

76.9%

Gemini 3.1 Pro Preview

76.9%

4Claude Sonnet 4.6

76.8%

Claude Sonnet 4.6

76.8%

5Gemini 3 Flash Preview

76.2%

Gemini 3 Flash Preview

76.2%

6Claude Opus 4.7

75.8%

Claude Opus 4.7

75.8%

7Grok 4.3

75.3%

Grok 4.3

75.3%

8Gemini 3.1 Flash Lite Preview

74.2%

Gemini 3.1 Flash Lite Preview

74.2%

9Gemini 3.5 Flash

74.1%

Gemini 3.5 Flash

74.1%

10GPT-5.4 mini

72.5%

GPT-5.4 mini

72.5%

11Claude Haiku 4.5

70.9%

Claude Haiku 4.5

70.9%

12Grok 4.1 Fast

70.9%

Grok 4.1 Fast

70.9%

13GPT-5.4 nano

60.9%

GPT-5.4 nano

60.9%

Optionsscoring: within 1% · cases: all cases · weighting: household · all 18 programs

Per-variable weights18 variables, sorted by Household

Variable	Household	Aggregate	Equal
Person-level Medicaid eligibility	30.29%	7.41%	5.56%
Federal tax before refundable credits	16.70%	61.48%	5.56%
Payroll tax	14.75%	9.28%	5.56%
Person-level Medicare eligibility	11.63%	3.96%	5.56%
SNAP	7.18%	1.13%	5.56%
State tax before refundable credits	5.42%	11.06%	5.56%
Federal refundable credits	4.44%	1.34%	5.56%
Self-employment tax	2.57%	1.91%	5.56%
SSI	2.14%	0.79%	5.56%
State refundable credits	1.23%	0.26%	5.56%
Free school meals eligibility	1.22%	0.42%	5.56%
Person-level Early Head Start eligibility	0.61%	0.30%	5.56%
Person-level WIC eligibility	0.48%	0.15%	5.56%
Person-level Head Start eligibility	0.44%	0.21%	5.56%
TANF	0.43%	0.11%	5.56%
Person-level CHIP eligibility	0.34%	0.14%	5.56%
Reduced-price school meals eligibility	0.13%	0.04%	5.56%
Local income tax	0.00%	0.00%	5.56%

By program

Program breakdown

Bounded score by program and model (AI alone, without tools). Dollar targets use continuous relative-error partial credit; binary coverage flags use exact accuracy.

Program filterAll 18 programs

Program	Claude Opus 4.7	Claude Sonnet 4.6	Claude Haiku 4.5	Grok 4.3	Grok 4.20	Grok 4.1 Fast	GPT-5.5	GPT-5.4 mini	GPT-5.4 nano	Gemini 3.1 Pro Preview	Gemini 3.5 Flash	Gemini 3 Flash Preview	Gemini 3.1 Flash Lite Preview	Avg
Local income tax	100%	100%	95%	100%	100%	100%	100%	100%	100%	100%	99%	100%	96%	99%
SSI	99%	100%	98%	98%	99%	97%	100%	98%	98%	99%	99%	99%	99%	99%
Self-employment tax	99%	99%	97%	98%	99%	99%	100%	98%	95%	99%	99%	98%	97%	98%
TANF	97%	99%	98%	97%	98%	98%	98%	97%	98%	98%	98%	97%	97%	98%
Reduced-price school meals eligibility	99%	99%	93%	100%	99%	97%	97%	98%	97%	98%	97%	98%	98%	98%
Free school meals eligibility	99%	99%	92%	97%	99%	97%	100%	96%	92%	100%	99%	100%	98%	98%
Person-level WIC eligibility	96%	99%	94%	99%	99%	92%	100%	92%	97%	100%	100%	100%	97%	97%
Person-level Medicare eligibility	96%	98%	95%	99%	99%	97%	96%	95%	92%	96%	97%	97%	99%	96%
Person-level Early Head Start eligibility	93%	95%	90%	95%	98%	88%	95%	93%	95%	98%	100%	100%	93%	95%
Person-level CHIP eligibility	88%	87%	89%	95%	98%	92%	95%	90%	99%	99%	98%	98%	100%	94%
Payroll tax	98%	98%	92%	98%	98%	91%	98%	89%	69%	97%	96%	97%	98%	94%
Person-level Head Start eligibility	93%	95%	88%	90%	95%	90%	95%	81%	93%	100%	95%	93%	90%	92%
Person-level Medicaid eligibility	90%	94%	88%	93%	96%	87%	96%	85%	70%	94%	93%	95%	86%	90%
Federal refundable credits	93%	93%	76%	81%	95%	72%	97%	77%	79%	93%	87%	85%	84%	86%
State refundable credits	80%	83%	81%	81%	86%	81%	86%	81%	81%	88%	85%	83%	82%	83%
SNAP	82%	82%	73%	78%	86%	77%	85%	80%	75%	85%	83%	81%	80%	80%
Federal tax before refundable credits	88%	83%	66%	72%	82%	61%	90%	60%	60%	83%	78%	77%	71%	75%
State tax before refundable credits	75%	80%	56%	72%	78%	57%	83%	54%	48%	78%	77%	74%	70%	70%

<50%

50-59%

60-69%

70-79%

80-89%

90%+

Deep dive

Scenario explorer

Inspect benchmark households, reference outputs, model answers, and the exact prompt sent to every model.

Household

Show

Provider

Program	Reference	Opus 4.7	Grok 4.3	GPT-5.5	Pro Preview
Federal tax before refundable credits	$4,455
Federal refundable credits	$0
Free school meals eligibility	No
Head CHIP eligibility	No
Head Medicaid eligibility	No
Head Medicare eligibility	No
Head WIC eligibility	No
Local income tax	$0
Payroll tax	$3,060
Reduced-price school meals eligibility	No
Self-employment tax	$6,358
SNAP	$0
Spouse CHIP eligibility	No
Spouse Medicaid eligibility	No
Spouse Medicare eligibility	No
Spouse WIC eligibility	No
SSI	$0
State tax before refundable credits	$0
State refundable credits	$0
TANF	$0

Explanation and audit coverage

260 of 260 model-output rows for this household include explanation text returned by the model. 24 rows include developer audit notes for incorrect predictions, and 24 incorrect rows include case-level notes comparing wrong models on the same household-output target.

LLM: 24

Exact prompt

Full household batch contract for all benchmark outputs

Provider-specific structured-output transport, no external tool

Estimate the requested tax and benefit outputs using only the household facts below. All listed people live together and are in one household group for tax and benefit calculations. All listed facts describe the full tax-benefit year. Treat demographic, work, student, disability, housing, health coverage, and household-composition facts as constant throughout the tax-benefit year, with no within-year income volatility or status changes. Gross wage and salary amounts are annual totals, including any overtime pay; hourly wage is a straight-time rate when listed. Treat any unlisted numeric input as 0 and any other unlisted household fact, boolean, or status input as false. Assume tax filing and program take-up when required. Do not infer unlisted income, expenses, assets, benefit receipt, rent, or health coverage.

Household:
- state: TX
- tax year: 2026

Head:
- age: 63
- has employer-sponsored insurance
- usual weekly hours worked: 45
- other medical expenses: $300
- over-the-counter health expenses: $20
- real estate taxes: $9,500
- self-employment income: $45,000

Spouse:
- age: 61
- gross wages and salaries: $40,000
- bank account assets: $400
- employer sponsored insurance premiums: $18,208
- has employer-sponsored insurance
- health insurance premiums excluding Medicare Part B: $3,000
- hourly wage: $19
- usual weekly hours worked: 40
- other health insurance premiums: $3,000
- other medical expenses: $6,000
- over-the-counter health expenses: $1,200

Household inputs:
- household vehicles value: $40,500

Provide the following policy quantities for this household:
- federal_income_tax_before_refundable_credits: federal individual income tax after nonrefundable credits and before refundable credits. This subtracts nonrefundable credits actually used, including CDCC and the nonrefundable portion of CTC or other credits when applicable; it does not subtract EITC or refundable portions of credits such as refundable CTC
- federal_refundable_credits: total refundable federal income tax credits, including EITC and refundable portions of credits such as refundable CTC when applicable; exclude the ACA Premium Tax Credit
- payroll_tax: annual household employee-side payroll tax: employee Social Security tax, employee Medicare tax, Additional Medicare Tax, and mandatory employee state payroll taxes. Exclude employer payroll taxes, FUTA, employer unemployment-insurance taxes, and self-employment tax
- self_employment_tax: annual self-employment tax liability, excluding employee payroll taxes and Additional Medicare Tax
- state_income_tax_before_refundable_credits: state individual income tax after nonrefundable credits and before refundable credits, excluding local income and payroll taxes
- state_refundable_credits: total refundable state individual income tax credits
- local_income_tax: annual local income, wage, and earnings tax liability in the separate local-income-tax output: NYC income tax, Philadelphia wage tax, Kansas City earnings tax, and St. Louis earnings tax where applicable
- snap: annual SNAP (food stamps) benefit amount
- ssi: annual Supplemental Security Income (SSI) amount
- tanf: annual Temporary Assistance for Needy Families (TANF) benefit amount
- head_wic_eligible: whether Head is eligible for WIC (1 if yes, 0 if no)
- spouse_wic_eligible: whether Spouse is eligible for WIC (1 if yes, 0 if no)
- head_medicaid_eligible: whether Head is eligible for Medicaid under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- spouse_medicaid_eligible: whether Spouse is eligible for Medicaid under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- head_chip_eligible: whether Head is eligible for CHIP under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- spouse_chip_eligible: whether Spouse is eligible for CHIP under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no)
- head_medicare_eligible: whether Head is eligible for Medicare (1 if yes, 0 if no)
- spouse_medicare_eligible: whether Spouse is eligible for Medicare (1 if yes, 0 if no)
- free_school_meals_eligible: whether PolicyEngine returns positive annual free school meal support for the household (1 if yes, 0 if no; reduced-price meals do not count as 1)
- reduced_price_school_meals_eligible: whether PolicyEngine returns positive annual reduced-price school meal support for the household (1 if yes, 0 if no; free meals do not count as 1)

Use the `submit_outputs` function exactly once. Return an `outputs` object with every requested quantity keyed by variable name. Each requested key must map to an object with a numeric `value` and a non-empty, specific, concise `explanation`. Each explanation must support the numeric value submitted for the same variable in `outputs`. If an explanation mentions a final amount, that amount must match the corresponding `outputs` value. Do not write that you will use one value while submitting a different value. Do not include scratch work, abandoned calculations, or corrections. End each explanation with `value = X`, where X exactly matches the numeric `value` field. For 1/0 eligibility outputs, submit 1 only when the explanation says eligible or yes, and submit 0 only when it says not eligible or no. Use the exact variable names as keys inside `outputs`. Include every requested key exactly once in `outputs`, even if the value is 0. Put only numeric values in `value`, with no dollar signs, commas, or explanatory text. Do not rely on plain text for the final answers. If an answer is a currency amount, give the annual amount. If an answer is a rate, give a decimal (e.g. 0.25 for 25%).

Failure modes

Where models still break

The hardest part of PolicyBench is not saying when a program is zero. It is getting the positive amount right for the households that actually qualify. The cards below split those cases apart so the benchmark is not flattered by easy zero-answer rows.

How to read these cards

These cards are intentionally stricter than the aggregate leaderboard but still use within-10% accuracy for dollar-valued programs so positive cases stay interpretable. Positive-amount cases is the harder and more informative number for benefits and refundable credits. For binary coverage flags, the cards compare positive and negative class accuracy.

Dollar target

State tax before refundable credits

Overall 54.7%

Positive-amount cases27.2%

Zero-amount cases94.2%

With children47.0%

Low income69.5%

High income55.4%

Underpredict share on positives51.0%

Dollar target

Federal tax before refundable credits

Overall 55.0%

Positive-amount cases30.3%

Zero-amount cases93.7%

With children59.2%

Low income88.8%

High income36.2%

Underpredict share on positives41.2%

Dollar target

SNAP

Overall 76.6%

Positive-amount cases15.4%

Zero-amount cases97.0%

With children68.3%

Low income30.5%

High income100.0%

Underpredict share on positives79.7%

Dollar target

Federal refundable credits

Overall 79.7%

Positive-amount cases24.2%

Zero-amount cases94.4%

With children47.6%

Low income61.8%

High income94.9%

Underpredict share on positives78.0%

Dollar target

State refundable credits

Overall 81.1%

Positive-amount cases4.9%

Zero-amount cases99.0%

With children74.0%

Low income62.4%

High income100.0%

Underpredict share on positives91.5%

Dollar target

Payroll tax

Overall 85.4%

Positive-amount cases82.2%

Zero-amount cases95.5%

With children84.9%

Low income92.0%

High income75.6%

Underpredict share on positives27.7%

Household boolean

Person-level Medicaid eligibility

Overall 89.6%

Positive households74.9%

Negative households95.4%

With children88.7%

Low income80.9%

High income96.4%

Household boolean

Person-level Head Start eligibility

Overall 92.3%

Positive households84.6%

Negative households92.5%

With children92.3%

Low income93.4%

High income97.7%

Household boolean

Person-level CHIP eligibility

Overall 94.1%

Positive householdsn/a

Negative households94.1%

With children86.8%

Low income89.7%

High income99.5%

Household boolean

Person-level Early Head Start eligibility

Overall 94.9%

Positive households88.5%

Negative households95.5%

With children94.9%

Low income96.7%

High income93.8%

What the error reads show

These expanders summarize recurring miss patterns from direct reads of model answers and explanations. They sit here with failure modes because they describe why the low-scoring program slices break.

State tax before refundable credits

This output usually fails when models import rough federal or flat-rate logic into state-specific tax bases.

Avg 70%

Common misses

In small-liability cases like scenarios_055 and _060, several models overshoot by a wide margin relative to the reference value.
In large cases like scenario_042, the main failure is still the wrong state tax base rather than the final credit step.

Federal tax before refundable credits

This target isolates federal income tax after nonrefundable credits but before refundable credits.

Avg 75%

Common misses

It subtracts nonrefundable credits actually used, such as CDCC and the nonrefundable part of CTC when applicable.
It leaves EITC and refundable credit portions, such as refundable CTC, for the refundable-credits output.

SNAP

Positive SNAP cases are the main miss; many models zero them out using raw asset or net-worth heuristics.

Avg 80%

Common misses

In scenarios_035 and _047, several models return $0 on households with reference values above $11,000.
In scenario_092, Gemini Pro cites SNAP asset limits and returns $0 even though the visible prompt inputs do not support that denial.

Federal refundable credits

This target captures the refundable federal credit side of the income-tax calculation.

Avg 86%

Common misses

It includes EITC and refundable portions of credits such as refundable CTC when applicable.
It keeps refundable income-tax credits separate from the nonrefundable-credit target.

State refundable credits

Most rows are easy zeros; the informative misses are the few positive state credits that models leave at zero.

Avg 83%

Common misses

In Colorado scenario_090, several models return $0 against a $6,836 reference value.
When models do predict a positive state credit, they often derive it from a rough federal-credit ratio instead of the state program itself.

Payroll tax

This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.

Avg 94%

Common misses

Person-level Medicaid eligibility

Models often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.

Avg 90%

Common misses

In scenarios_054 and _076, many models return 0 when the reference flag is 1.
The errors are not just arithmetic. They reflect the wrong eligibility pathway being chosen from the household facts.

Person-level Head Start eligibility

This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.

Avg 92%

Common misses

Person-level CHIP eligibility

Models often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.

Avg 94%

Common misses

In scenarios_054 and _076, many models return 0 when the reference flag is 1.
The errors are not just arithmetic. They reflect the wrong eligibility pathway being chosen from the household facts.

Person-level Early Head Start eligibility

This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.

Avg 95%

Common misses

Methodology

How the United States benchmark works

PolicyBench measures a no-tools task: how well frontier models can estimate person- and household-level tax and benefit outputs from the prompt alone while following a structured response contract. This app shows the current no-tools US benchmark on a fixed test set, with PolicyEngine reference outputs computed by PolicyEngine-US for tax year 2026.

100

Enhanced CPS households

Scored variables

2,088

Model-output targets

Frontier models

Task

Each model sees the same household description and must return all scored outputs plus a short explanation for each output in one response, with no tool use. The exact provider-specific prompts are visible in the scenario explorer, so you can inspect the contract instead of inferring it.

Open-set status

The public scenario explorer exposes prompts and PolicyEngine reference outputs, so future model releases or fine-tunes could learn from the released cases. Treat this leaderboard as a public preview; protected held-out claims would require a separate rotating evaluation set.

Households

The US benchmark samples households from the Enhanced CPS with a fixed seed. The current set is restricted to households with a single federal tax unit, a single family, and a single benefit-calculation unit. Adult dependents remain in scope when they satisfy those restrictions. Ages, roles, income sources, and other nonzero promptable inputs are carried through into both the prompt and the PolicyEngine-US input; filing status is inferred from household structure.

Reference outputs

PolicyEngine-US computes the PolicyEngine reference output for every household-variable pair in tax year 2026. The displayed variables define the benchmark scope for this snapshot.

Output selection

The benchmark includes direct tax, credit, benefit, health-support, and coverage outputs that can plausibly be estimated from household facts. It excludes intermediate tax bases, payroll subcomponents, and outputs that mainly require unavailable history, restricted local market data, or program take-up assignment. WIC is scored as person-level eligibility, not as a dollar amount. Local income tax is retained as a displayed requested output, but currently receives zero default population-impact weight because the full Enhanced CPS source has no positive modeled local-income-tax records. The source run also requested the ACA Premium Tax Credit, but explanation audits showed the prompt could be misleading when households lacked plan-specific Marketplace information, so it is preserved in raw responses and excluded from the scored leaderboard.

Scoring and weighting

The public leaderboard ranks models by the within-1% hit rate using population household-impact weights. For each household-output row, the within-1% indicator is 1 when a currency answer is within 1% of the PolicyEngine reference value, with a one-currency-unit tolerance when the reference is zero. Binary eligibility flags are requested as integer 0/1 outputs and require exact 0/1 matching. The secondary bounded score uses max(0, 1 − |pred − ref| / |ref|)when the reference is nonzero and exact zero matches when the reference is zero for amount outputs, and the same exact 0/1 rule for binary outputs. Each full source household's per-output share is |ref| / max(|household_net_income|, Σ |ref|), a value in [0, 1] that's strictly less than one when net income dominates the gross tax-benefit flow and equals one only when programs cancel each other out. Those shares are averaged using calibrated household weights in the full weighting population, then renormalized so the output weights sum to one. US weights use the full Enhanced CPS; UK weights use the full enhanced FRS. This weighting source is separate from the UK benchmark scenarios, which use the public calibrated transfer dataset. The weights are then applied to the fixed benchmark households and renormalized within each household over requested outputs. Person-level eligibility flags like Medicaid carry weight through PolicyEngine's paired per-capita value (e.g. medicaid_value), so the LLM is graded only on the boolean call itself. Missing or unparseable answers count as misses through the coverage multiplier. The leaderboard reports within-1% as the headline, exact match as the deployability bar, and bounded score, amount accuracy, and participation accuracy as diagnostic companions. Equal-weight and budget-weighted variants are reported alongside for transparency. The leaderboard is a point estimate on this fixed test set.

Sensitivity checks

The manuscript reports alternative ranking views for equal-output groups, amount-only outputs, binary coverage, positive-reference cases, zero-reference cases, and country-only results. In the equal-output-group view, person-level outputs are grouped by program before the country average. These checks are used to interpret rank stability; they do not replace the public within-1% leaderboard.

Impact weighting

Binary coverage flags have 0/1 labels, but a 0/1 label is not their economic impact. Their leaderboard weights therefore come from PolicyEngine value proxies where available, such as estimated health coverage or nutrition-program value, rather than from the binary label itself. When every reference impact in a household is zero, the household falls back to equal output weights.

Current benchmark scope

Latest United States run in this app evaluates GPT-5.5, Grok 4.20, Gemini 3.1 Pro Preview, Claude Sonnet 4.6, Gemini 3 Flash Preview, Claude Opus 4.7, Grok 4.3, Gemini 3.1 Flash Lite Preview, Gemini 3.5 Flash, GPT-5.4 mini, Claude Haiku 4.5, Grok 4.1 Fast, GPT-5.4 nano on 2,088 scored outputs.

Fixed test set, no tools, US tax year 2026 / UK fiscal year 2026-27

Federal refundable creditsFederal taxFederal tax before refundable creditsFederal taxFree school meals eligibilityCoverageLocal income taxLocal taxPayroll taxPayroll taxPerson-level CHIP eligibilityCoveragePerson-level Early Head Start eligibilityCoveragePerson-level Head Start eligibilityCoveragePerson-level Medicaid eligibilityCoveragePerson-level Medicare eligibilityCoveragePerson-level WIC eligibilityCoverageReduced-price school meals eligibilityCoverageSelf-employment taxPayroll taxSNAPBenefitsSSIBenefitsState refundable creditsState taxState tax before refundable creditsState taxTANFBenefits

PolicyBench leaderboard

Model rankings

Program breakdown

Scenario explorer

Where models still break

How the United States benchmark works