Leaderboard
Global rankings
Global scores are equal-weight averages of each model’s US and UK benchmark scores. Only models with both country runs are included.
#
Model
Global score
Country scores
1Gemini 3.1 Pro Preview
Country scores
US73.4%UK82.6%
1
Gemini 3.1 Pro Preview
78.0%
US73.4%UK82.6%
2Grok 4.20
Country scores
US71.7%UK83.0%
2
Grok 4.20
77.3%
US71.7%UK83.0%
3Gemini 3 Flash Preview
Country scores
US69.5%UK71.6%
3
Gemini 3 Flash Preview
70.6%
US69.5%UK71.6%
4GPT-5.4
Country scores
US67.2%UK68.9%
4
GPT-5.4
68.0%
US67.2%UK68.9%
5Gemini 3.1 Flash-Lite Preview
Country scores
US65.9%UK68.7%
5
Gemini 3.1 Flash-Lite Preview
67.3%
US65.9%UK68.7%
6Claude Opus 4.6
Country scores
US66.2%UK68.1%
6
Claude Opus 4.6
67.1%
US66.2%UK68.1%
7Claude Sonnet 4.6
Country scores
US65.2%UK67.5%
7
Claude Sonnet 4.6
66.4%
US65.2%UK67.5%
8GPT-5.4 mini
Country scores
US64.1%UK64.9%
8
GPT-5.4 mini
64.5%
US64.1%UK64.9%
9Claude Haiku 4.5
Country scores
US62.4%UK64.0%
9
Claude Haiku 4.5
63.2%
US62.4%UK64.0%
10GPT-5.4 nano
Country scores
US60.2%UK63.4%
10
GPT-5.4 nano
61.8%
US60.2%UK63.4%
11Grok 4.1 Fast
Country scores
US49.0%UK66.0%
11
Grok 4.1 Fast
57.5%
US49.0%UK66.0%
Methodology
How global scores work
The global leaderboard is a shared-model aggregate, not a separate benchmark. Each model’s global score is the equal-weight average of its country-level PolicyBench scores for the United States and United Kingdom.
2
Country benchmarks
11
Shared models
2,000
Total households
2025
Tax year
Aggregation
Only models with both country runs appear in the global table. Their global score is the average of the bounded country scores, rather than a currency-weighted or output-weighted blend.
Interpretation
This view answers a narrow question: which models travel best across policy systems? It does not replace the country-specific leaderboards, and it intentionally omits mean absolute error because dollars and pounds are not directly comparable.
Included country benchmarks
United States
1,000 households, 13 outputs, 11 evaluated models.
United Kingdom
1,000 households, 6 outputs, 11 evaluated models.