PolicyBenchbyPolicyEngine

11 frontier models across 2,000 households in 2 countries. 100% = exact answers across the full benchmark.

78.0%Top score
2Countries
11Models
2,000Households
Leaderboard

Global rankings

Global scores are equal-weight averages of each model’s US and UK benchmark scores. Only models with both country runs are included.

1Gemini 3.1 Pro Preview
78.0%
Country scores
US73.4%UK82.6%
2Grok 4.20
77.3%
Country scores
US71.7%UK83.0%
3Gemini 3 Flash Preview
70.6%
Country scores
US69.5%UK71.6%
4GPT-5.4
68.0%
Country scores
US67.2%UK68.9%
5Gemini 3.1 Flash-Lite Preview
67.3%
Country scores
US65.9%UK68.7%
6Claude Opus 4.6
67.1%
Country scores
US66.2%UK68.1%
7Claude Sonnet 4.6
66.4%
Country scores
US65.2%UK67.5%
8GPT-5.4 mini
64.5%
Country scores
US64.1%UK64.9%
9Claude Haiku 4.5
63.2%
Country scores
US62.4%UK64.0%
10GPT-5.4 nano
61.8%
Country scores
US60.2%UK63.4%
11Grok 4.1 Fast
57.5%
Country scores
US49.0%UK66.0%
Methodology

How global scores work

The global leaderboard is a shared-model aggregate, not a separate benchmark. Each model’s global score is the equal-weight average of its country-level PolicyBench scores for the United States and United Kingdom.

2
Country benchmarks
11
Shared models
2,000
Total households
2025
Tax year
Aggregation
Only models with both country runs appear in the global table. Their global score is the average of the bounded country scores, rather than a currency-weighted or output-weighted blend.
Interpretation
This view answers a narrow question: which models travel best across policy systems? It does not replace the country-specific leaderboards, and it intentionally omits mean absolute error because dollars and pounds are not directly comparable.
Included country benchmarks
United States
1,000 households, 13 outputs, 11 evaluated models.
United Kingdom
1,000 households, 6 outputs, 11 evaluated models.