This is a Brookings Center on Regulation and Markets working paper.
Abstract
Benchmarks are essential for understanding Large Language Models (LLM) capabilities yet are costly to develop and update for real world work tasks. Here we use an Agentic AI approach, where LLMs themselves automatically generate and evaluate practical exams for tasks across Finance & Business Operations, Management, and Computer & Mathematics occupations. To develop these exams, we distinguish between materials needed (e.g. text, data, images) and tools required to solve them (e.g. function calling, web search). Focusing on text-only tasks requiring no tool use, we find only 7% (149 tasks) of these occupations are testable. We deploy 13 models including GPT, Claude, and Gemini variants to complete these synthetic exams. Even on basic tasks, current LLMs struggle: leading models achieve median scores of 65-79%, with performance particularly weak on data manipulation and financial calculations. However, models show rapid improvement—those released in 2024 averaged 40.5% while 2025 models reached 66%, a 26 percentage point gain in one year. While considerable work remains in validation and extending to tool use to expand beyond the current text-only testable tasks, our results suggest that LLM-generated benchmarks may offer a cost-effective, scalable, and updatable approach for measuring AI workplace capabilities, extending the “LLM-as-a-judge” paradigm to occupational task assessment.
The Brookings Institution is committed to quality, independence, and impact.
We are supported by a diverse array of funders. In line with our values and policies, each Brookings publication represents the sole views of its author(s).