Self-building benchmarks: Using AI-generated exams to understand LLM work capabilities

Editor's note:

This is a Brookings Center on Regulation and Markets working paper.

Abstract

Benchmarks are essential for understanding Large Language Models (LLM) capabilities yet are costly to develop and update for real world work tasks. Here we use an Agentic AI approach, where LLMs themselves automatically generate and evaluate practical exams for tasks across Finance & Business Operations, Management, and Computer & Mathematics occupations. To develop these exams, we distinguish between materials needed (e.g. text, data, images) and tools required to solve them (e.g. function calling, web search). Focusing on text-only tasks requiring no tool use, we find only 7% (149 tasks) of these occupations are testable. We deploy 13 models including GPT, Claude, and Gemini variants to complete these synthetic exams. Even on basic tasks, current LLMs struggle: leading models achieve median scores of 65-79%, with performance particularly weak on data manipulation and financial calculations. However, models show rapid improvement—those released in 2024 averaged 40.5% while 2025 models reached 66%, a 26 percentage point gain in one year. While considerable work remains in validation and extending to tool use to expand beyond the current text-only testable tasks, our results suggest that LLM-generated benchmarks may offer a cost-effective, scalable, and updatable approach for measuring AI workplace capabilities, extending the “LLM-as-a-judge” paradigm to occupational task assessment.

Download the full working paper

Self-building benchmarks: Using AI-generated exams to understand LLM work capabilities

Contact

Self-building benchmarks: Using AI-generated exams to understand LLM work capabilities

Maria del Rio-Chanona and Johanna Einsiedler

Abstract

Self-building benchmarks: Using AI-generated exams to understand LLM work capabilities

Contact

Contact

Subscribe to the Economic Studies Bulletin

Self-building benchmarks: Using AI-generated exams to understand LLM work capabilities

Maria del Rio-Chanona and Maria del Rio-Chanona Assistant Professor - University College London Johanna Einsiedler Johanna Einsiedler Researcher - University of Copenhagen, University of Basel

Abstract

Maria del Rio-Chanona and Johanna Einsiedler