Sections

Agentic AI Evaluation

Through this series of convenings and research publications, Brookings, CMU, and UC Berkeley aim to advance research and policymaking that addresses foundational challenges in measurement and evaluation of agentic AI systems.

Abstract illustration
Abstract illustration of agentic artificial intelligence. Image generated by Adobe Firefly, consistent with Brookings' generative AI use policies.

The Challenge

As policymakers and industry leaders around the world work to develop governance solutions to keep pace with AI development and deployment, they face a shared fundamental challenge: effective governance relies on the ability to measure. Any system of accountability, liability, or oversight depends on a reliable and scalable way of understanding what AI systems are capable of and verifying that they are fit for purpose. Without robust measurement tools, many claims about the capabilities, risks, and impacts of AI can be difficult to verify independently.

This challenge is especially pronounced for agentic AI systems–commonly defined as models that perceive context, set and update goals, plan, and take action through tools or environments. These systems are transitioning from research settings into real-world  workflows, even as both their precise definition and the methods for measuring their performance and reliability continue to evolve.

  • What is agentic AI?

    No single definition of agentic AI has gained universal acceptance, which can make it harder to ground measurement and evaluation frameworks, and the policy decisions that depend on them. Researchers, developers, and policymakers have proposed their own definitions, often emphasizing different features and trade-offs. “Agency” spans a spectrum across several practical dimensions, which may include autonomy, planning horizons, tool use and environment coupling, and adaptivity. Familiarity with this spectrum may inform both technical evaluation and policy considerations related to evaluation, accountability, and deployment contexts.

  • Technical barriers to evaluation

    While AI evaluation for conventional AI tools is already subject to several well-documented limitations, evaluating agentic AI systems presents a considerably greater technical challenge. Core capabilities remain difficult to measure systematically; autonomy levels vary widely across implementations; social intelligence can manifest differently in controlled settings versus real-world environments; and long-horizon planning complicates standardization through conventional benchmarks. Current evaluation frameworks tend to focus narrowly on capability and accuracy, may not fully reflect real-world deployment conditions, and operate without agreed-upon standards for scientific validity.

  • Consequences for adoption

    The measurement gap has practical consequences for deployers making adoption decisions. Organizations may find it difficult to deploy systems whose behavior they cannot reliably predict, and without reliable evaluation frameworks, it can be challenging to fully assess risks around security, liability, misuse, and systemic harms. This challenge is even more pronounced where agentic AI systems are acting on behalf of, rather than in conjunction with, human operators. In many deployment scenarios, users are expected to delegate their workflows, private information, or online identities to these systems, and difficulty anticipating performance or failure modes may create that slows broader deployment and integration of agentic AI into business-critical workflows. As agentic AI systems are integrated into increasingly sensitive domains and become more widespread across our economy, expanding coverage, reproducibility, and comparability will be important to help evaluations remain transparent and can provide insights that are useful to consumers, deployers, and policymakers.

Closing the gap

AI governance and adoption are closely tied to measurement. Yet the field of AI metrology and evaluation is still developing and varied in its approaches. The continued advancement of science, adoption, and governance of AI points to the value of structured, interdisciplinary collaboration. Technical researchers, policymakers, social scientists, and legal experts would benefit from co-developing evaluation approaches that balance innovation with safety, accountability, and context sensitivity. This is particularly relevant for autonomous systems where evaluations need to capture both capabilities and limitations, failure modes, and interaction effects within complex organizational contexts.

Our Solution

Last year, The Brookings Institution and Carnegie Mellon University gathered a group of stakeholders from across academia, industry, government, and civil society to launch a series of convenings and research publications that aim to 1) build consensus on foundational questions in agentic AI measurement and evaluation, 2) set out a research agenda for addressing those questions, and 3) help bring together a network of experts that will carry this work forward. 

Over 2026, Brookings and CMU, in partnership with UC Berkeley, will convene a series of multistakeholder workshops and public events as part of this joint research effort to address both the technical and the legal and policy dimensions of the measurement challenge for agentic AI systems, with the following target outcomes:

  • Build a multidisciplinary network of agentic AI experts

    • Develop a collaborative community spanning academia, industry, government, and civil society.
    • Bring together stakeholders who understand how measurement and evaluation drive innovation in agentic AI design, policy, and governance.
  • Develop a research roadmap for agentic AI measurement

    • Address measurement and evaluation challenges along the entire agentic AI stack–from deployment to workflow integration.
    • Create trustworthy telemetry and instrumentation designed to provide data at different levels of abstraction (e.g., scratch pad COT, tool use and invocation logs, cognitive logs using models such as belief, desire and intention) to support auditable, interpretable and governable agentic AI models and systems.
    • Extend beyond basic evaluation to create frameworks for red-teaming, blue-teaming, and field testing that measure both immediate and long-term benefits and risks of agentic AI deployment.
  • Foster cross-sector collaboration and shared infrastructure

    • Identify and pursue opportunities for joint initiatives between academia, industry, civil society, and government.
    • Target key deliverables, including standardized benchmarks, open-source evaluation tools, and technical sandboxes that can support evidence-based policy development.
  • Support an evidence-informed policy and governance framework

    • Develop a roadmap for agentic AI governance grounded in measurement and evaluation practices. Leverage the expert network to both craft this framework and use it to inform policymakers at state, federal, and international levels.

Register your interest at the link below to let us know if you’d like to receive updates on our future work.

Interest Form

Project Contributors

The Brookings Institution

Carnegie Mellon University

University of California Berkeley