PolicyBench leaderboard
Model rankings
13 models, ranked by within-1% hit rate score (United States)
Optionsscoring: within 1% · cases: all cases · weighting: household · all 18 programs
Filter restricts model scoring and the program breakdown table to the selected outputs. The scenario explorer remains unfiltered so each household's full prompt stays visible. Weights rescale to 100% over the active set.
Per-variable weights18 variables, sorted by Household
| Variable | Household | Aggregate | Equal |
|---|---|---|---|
| Person-level Medicaid eligibility | 30.29% | 7.41% | 5.56% |
| Federal tax before refundable credits | 16.70% | 61.48% | 5.56% |
| Payroll tax | 14.75% | 9.28% | 5.56% |
| Person-level Medicare eligibility | 11.63% | 3.96% | 5.56% |
| SNAP | 7.18% | 1.13% | 5.56% |
| State tax before refundable credits | 5.42% | 11.06% | 5.56% |
| Federal refundable credits | 4.44% | 1.34% | 5.56% |
| Self-employment tax | 2.57% | 1.91% | 5.56% |
| SSI | 2.14% | 0.79% | 5.56% |
| State refundable credits | 1.23% | 0.26% | 5.56% |
| Free school meals eligibility | 1.22% | 0.42% | 5.56% |
| Person-level Early Head Start eligibility | 0.61% | 0.30% | 5.56% |
| Person-level WIC eligibility | 0.48% | 0.15% | 5.56% |
| Person-level Head Start eligibility | 0.44% | 0.21% | 5.56% |
| TANF | 0.43% | 0.11% | 5.56% |
| Person-level CHIP eligibility | 0.34% | 0.14% | 5.56% |
| Reduced-price school meals eligibility | 0.13% | 0.04% | 5.56% |
| Local income tax | 0.00% | 0.00% | 5.56% |
Program breakdown
Bounded score by program and model (AI alone, without tools). Dollar targets use continuous relative-error partial credit; binary coverage flags use exact accuracy.
Program filterAll 18 programs
Shared with model scoring. The scenario explorer remains unfiltered so each household's full prompt stays visible. The table shows only selected outputs; model scores rescale selected weights to 100%.
| Program | Claude Opus 4.7 | Claude Sonnet 4.6 | Claude Haiku 4.5 | Grok 4.3 | Grok 4.20 | Grok 4.1 Fast | GPT-5.5 | GPT-5.4 mini | GPT-5.4 nano | Gemini 3.1 Pro Preview | Gemini 3.5 Flash | Gemini 3 Flash Preview | Gemini 3.1 Flash Lite Preview | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Local income tax | 100% | 100% | 95% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 99% | 100% | 96% | 99% |
| SSI | 99% | 100% | 98% | 98% | 99% | 97% | 100% | 98% | 98% | 99% | 99% | 99% | 99% | 99% |
| Self-employment tax | 99% | 99% | 97% | 98% | 99% | 99% | 100% | 98% | 95% | 99% | 99% | 98% | 97% | 98% |
| TANF | 97% | 99% | 98% | 97% | 98% | 98% | 98% | 97% | 98% | 98% | 98% | 97% | 97% | 98% |
| Reduced-price school meals eligibility | 99% | 99% | 93% | 100% | 99% | 97% | 97% | 98% | 97% | 98% | 97% | 98% | 98% | 98% |
| Free school meals eligibility | 99% | 99% | 92% | 97% | 99% | 97% | 100% | 96% | 92% | 100% | 99% | 100% | 98% | 98% |
| Person-level WIC eligibility | 96% | 99% | 94% | 99% | 99% | 92% | 100% | 92% | 97% | 100% | 100% | 100% | 97% | 97% |
| Person-level Medicare eligibility | 96% | 98% | 95% | 99% | 99% | 97% | 96% | 95% | 92% | 96% | 97% | 97% | 99% | 96% |
| Person-level Early Head Start eligibility | 93% | 95% | 90% | 95% | 98% | 88% | 95% | 93% | 95% | 98% | 100% | 100% | 93% | 95% |
| Person-level CHIP eligibility | 88% | 87% | 89% | 95% | 98% | 92% | 95% | 90% | 99% | 99% | 98% | 98% | 100% | 94% |
| Payroll tax | 98% | 98% | 92% | 98% | 98% | 91% | 98% | 89% | 69% | 97% | 96% | 97% | 98% | 94% |
| Person-level Head Start eligibility | 93% | 95% | 88% | 90% | 95% | 90% | 95% | 81% | 93% | 100% | 95% | 93% | 90% | 92% |
| Person-level Medicaid eligibility | 90% | 94% | 88% | 93% | 96% | 87% | 96% | 85% | 70% | 94% | 93% | 95% | 86% | 90% |
| Federal refundable credits | 93% | 93% | 76% | 81% | 95% | 72% | 97% | 77% | 79% | 93% | 87% | 85% | 84% | 86% |
| State refundable credits | 80% | 83% | 81% | 81% | 86% | 81% | 86% | 81% | 81% | 88% | 85% | 83% | 82% | 83% |
| SNAP | 82% | 82% | 73% | 78% | 86% | 77% | 85% | 80% | 75% | 85% | 83% | 81% | 80% | 80% |
| Federal tax before refundable credits | 88% | 83% | 66% | 72% | 82% | 61% | 90% | 60% | 60% | 83% | 78% | 77% | 71% | 75% |
| State tax before refundable credits | 75% | 80% | 56% | 72% | 78% | 57% | 83% | 54% | 48% | 78% | 77% | 74% | 70% | 70% |
Scenario explorer
Inspect benchmark households, reference outputs, model answers, and the exact prompt sent to every model.
| Program | Reference | Opus 4.7 | Grok 4.3 | GPT-5.5 | Pro Preview |
|---|---|---|---|---|---|
| Federal tax before refundable credits | $4,455 | ||||
| Federal refundable credits | $0 | ||||
| Free school meals eligibility | No | ||||
| Head CHIP eligibility | No | ||||
| Head Medicaid eligibility | No | ||||
| Head Medicare eligibility | No | ||||
| Head WIC eligibility | No | ||||
| Local income tax | $0 | ||||
| Payroll tax | $3,060 | ||||
| Reduced-price school meals eligibility | No | ||||
| Self-employment tax | $6,358 | ||||
| SNAP | $0 | ||||
| Spouse CHIP eligibility | No | ||||
| Spouse Medicaid eligibility | No | ||||
| Spouse Medicare eligibility | No | ||||
| Spouse WIC eligibility | No | ||||
| SSI | $0 | ||||
| State tax before refundable credits | $0 | ||||
| State refundable credits | $0 | ||||
| TANF | $0 |
260 of 260 model-output rows for this household include explanation text returned by the model. 24 rows include developer audit notes for incorrect predictions, and 24 incorrect rows include case-level notes comparing wrong models on the same household-output target.
Exact promptFull household batch contract for all benchmark outputsProvider-specific structured-output transport, no external tool
Estimate the requested tax and benefit outputs using only the household facts below. All listed people live together and are in one household group for tax and benefit calculations. All listed facts describe the full tax-benefit year. Treat demographic, work, student, disability, housing, health coverage, and household-composition facts as constant throughout the tax-benefit year, with no within-year income volatility or status changes. Gross wage and salary amounts are annual totals, including any overtime pay; hourly wage is a straight-time rate when listed. Treat any unlisted numeric input as 0 and any other unlisted household fact, boolean, or status input as false. Assume tax filing and program take-up when required. Do not infer unlisted income, expenses, assets, benefit receipt, rent, or health coverage. Household: - state: TX - tax year: 2026 Head: - age: 63 - has employer-sponsored insurance - usual weekly hours worked: 45 - other medical expenses: $300 - over-the-counter health expenses: $20 - real estate taxes: $9,500 - self-employment income: $45,000 Spouse: - age: 61 - gross wages and salaries: $40,000 - bank account assets: $400 - employer sponsored insurance premiums: $18,208 - has employer-sponsored insurance - health insurance premiums excluding Medicare Part B: $3,000 - hourly wage: $19 - usual weekly hours worked: 40 - other health insurance premiums: $3,000 - other medical expenses: $6,000 - over-the-counter health expenses: $1,200 Household inputs: - household vehicles value: $40,500 Provide the following policy quantities for this household: - federal_income_tax_before_refundable_credits: federal individual income tax after nonrefundable credits and before refundable credits. This subtracts nonrefundable credits actually used, including CDCC and the nonrefundable portion of CTC or other credits when applicable; it does not subtract EITC or refundable portions of credits such as refundable CTC - federal_refundable_credits: total refundable federal income tax credits, including EITC and refundable portions of credits such as refundable CTC when applicable; exclude the ACA Premium Tax Credit - payroll_tax: annual household employee-side payroll tax: employee Social Security tax, employee Medicare tax, Additional Medicare Tax, and mandatory employee state payroll taxes. Exclude employer payroll taxes, FUTA, employer unemployment-insurance taxes, and self-employment tax - self_employment_tax: annual self-employment tax liability, excluding employee payroll taxes and Additional Medicare Tax - state_income_tax_before_refundable_credits: state individual income tax after nonrefundable credits and before refundable credits, excluding local income and payroll taxes - state_refundable_credits: total refundable state individual income tax credits - local_income_tax: annual local income, wage, and earnings tax liability in the separate local-income-tax output: NYC income tax, Philadelphia wage tax, Kansas City earnings tax, and St. Louis earnings tax where applicable - snap: annual SNAP (food stamps) benefit amount - ssi: annual Supplemental Security Income (SSI) amount - tanf: annual Temporary Assistance for Needy Families (TANF) benefit amount - head_wic_eligible: whether Head is eligible for WIC (1 if yes, 0 if no) - spouse_wic_eligible: whether Spouse is eligible for WIC (1 if yes, 0 if no) - head_medicaid_eligible: whether Head is eligible for Medicaid under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no) - spouse_medicaid_eligible: whether Spouse is eligible for Medicaid under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no) - head_chip_eligible: whether Head is eligible for CHIP under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no) - spouse_chip_eligible: whether Spouse is eligible for CHIP under PolicyEngine rules, not whether they are currently enrolled (1 if yes, 0 if no) - head_medicare_eligible: whether Head is eligible for Medicare (1 if yes, 0 if no) - spouse_medicare_eligible: whether Spouse is eligible for Medicare (1 if yes, 0 if no) - free_school_meals_eligible: whether PolicyEngine returns positive annual free school meal support for the household (1 if yes, 0 if no; reduced-price meals do not count as 1) - reduced_price_school_meals_eligible: whether PolicyEngine returns positive annual reduced-price school meal support for the household (1 if yes, 0 if no; free meals do not count as 1) Use the `submit_outputs` function exactly once. Return an `outputs` object with every requested quantity keyed by variable name. Each requested key must map to an object with a numeric `value` and a non-empty, specific, concise `explanation`. Each explanation must support the numeric value submitted for the same variable in `outputs`. If an explanation mentions a final amount, that amount must match the corresponding `outputs` value. Do not write that you will use one value while submitting a different value. Do not include scratch work, abandoned calculations, or corrections. End each explanation with `value = X`, where X exactly matches the numeric `value` field. For 1/0 eligibility outputs, submit 1 only when the explanation says eligible or yes, and submit 0 only when it says not eligible or no. Use the exact variable names as keys inside `outputs`. Include every requested key exactly once in `outputs`, even if the value is 0. Put only numeric values in `value`, with no dollar signs, commas, or explanatory text. Do not rely on plain text for the final answers. If an answer is a currency amount, give the annual amount. If an answer is a rate, give a decimal (e.g. 0.25 for 25%).
Where models still break
The hardest part of PolicyBench is not saying when a program is zero. It is getting the positive amount right for the households that actually qualify. The cards below split those cases apart so the benchmark is not flattered by easy zero-answer rows.
These cards are intentionally stricter than the aggregate leaderboard but still use within-10% accuracy for dollar-valued programs so positive cases stay interpretable. Positive-amount cases is the harder and more informative number for benefits and refundable credits. For binary coverage flags, the cards compare positive and negative class accuracy.
These expanders summarize recurring miss patterns from direct reads of model answers and explanations. They sit here with failure modes because they describe why the low-scoring program slices break.
State tax before refundable creditsThis output usually fails when models import rough federal or flat-rate logic into state-specific tax bases.
Avg 70%
This output usually fails when models import rough federal or flat-rate logic into state-specific tax bases.
- In small-liability cases like scenarios_055 and _060, several models overshoot by a wide margin relative to the reference value.
- In large cases like scenario_042, the main failure is still the wrong state tax base rather than the final credit step.
Federal tax before refundable creditsThis target isolates federal income tax after nonrefundable credits but before refundable credits.
Avg 75%
This target isolates federal income tax after nonrefundable credits but before refundable credits.
- It subtracts nonrefundable credits actually used, such as CDCC and the nonrefundable part of CTC when applicable.
- It leaves EITC and refundable credit portions, such as refundable CTC, for the refundable-credits output.
SNAPPositive SNAP cases are the main miss; many models zero them out using raw asset or net-worth heuristics.
Avg 80%
Positive SNAP cases are the main miss; many models zero them out using raw asset or net-worth heuristics.
- In scenarios_035 and _047, several models return $0 on households with reference values above $11,000.
- In scenario_092, Gemini Pro cites SNAP asset limits and returns $0 even though the visible prompt inputs do not support that denial.
Federal refundable creditsThis target captures the refundable federal credit side of the income-tax calculation.
Avg 86%
This target captures the refundable federal credit side of the income-tax calculation.
- It includes EITC and refundable portions of credits such as refundable CTC when applicable.
- It keeps refundable income-tax credits separate from the nonrefundable-credit target.
State refundable creditsMost rows are easy zeros; the informative misses are the few positive state credits that models leave at zero.
Avg 83%
Most rows are easy zeros; the informative misses are the few positive state credits that models leave at zero.
- In Colorado scenario_090, several models return $0 against a $6,836 reference value.
- When models do predict a positive state credit, they often derive it from a rough federal-credit ratio instead of the state program itself.
Payroll taxThis target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.
Avg 94%
This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.
Person-level Medicaid eligibilityModels often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.
Avg 90%
Models often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.
- In scenarios_054 and _076, many models return 0 when the reference flag is 1.
- The errors are not just arithmetic. They reflect the wrong eligibility pathway being chosen from the household facts.
Person-level Head Start eligibilityThis target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.
Avg 92%
This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.
Person-level CHIP eligibilityModels often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.
Avg 94%
Models often overuse Medicare enrollment or visible assets as disqualifiers and miss non-wage eligibility pathways.
- In scenarios_054 and _076, many models return 0 when the reference flag is 1.
- The errors are not just arithmetic. They reflect the wrong eligibility pathway being chosen from the household facts.
Person-level Early Head Start eligibilityThis target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.
Avg 95%
This target combines multiple policy rules, and errors usually come from positive cases rather than zero cases.
How the United States benchmark works
PolicyBench measures a no-tools task: how well frontier models can estimate person- and household-level tax and benefit outputs from the prompt alone while following a structured response contract. This app shows the current no-tools US benchmark on a fixed test set, with PolicyEngine reference outputs computed by PolicyEngine-US for tax year 2026.
max(0, 1 − |pred − ref| / |ref|)when the reference is nonzero and exact zero matches when the reference is zero for amount outputs, and the same exact 0/1 rule for binary outputs. Each full source household's per-output share is |ref| / max(|household_net_income|, Σ |ref|), a value in [0, 1] that's strictly less than one when net income dominates the gross tax-benefit flow and equals one only when programs cancel each other out. Those shares are averaged using calibrated household weights in the full weighting population, then renormalized so the output weights sum to one. US weights use the full Enhanced CPS; UK weights use the full enhanced FRS. This weighting source is separate from the UK benchmark scenarios, which use the public calibrated transfer dataset. The weights are then applied to the fixed benchmark households and renormalized within each household over requested outputs. Person-level eligibility flags like Medicaid carry weight through PolicyEngine's paired per-capita value (e.g. medicaid_value), so the LLM is graded only on the boolean call itself. Missing or unparseable answers count as misses through the coverage multiplier. The leaderboard reports within-1% as the headline, exact match as the deployability bar, and bounded score, amount accuracy, and participation accuracy as diagnostic companions. Equal-weight and budget-weighted variants are reported alongside for transparency. The leaderboard is a point estimate on this fixed test set.