The testing and explainability challenge facing human-machine teaming

A stealthy U.S. combat drone demonstrates deploying a smaller drone from its weapons bay.

Militaries around the world are preparing for the next generation of warfare—one in which human-machine teams are integral to operations. Faster decisionmaking, remote sensing, and coordinating across domains and battlespaces will likely be the keys to victory in future conflicts. To realize these advantages, militaries are investing in human-machine teaming (HMT), a class of technologies that aim to marry human judgment with the data-processing and response capabilities of modern computing.

HMT includes a range of technologies—from autonomous drone swarms conducting reconnaissance to pairing a soldier with an unmanned ground vehicle to clear a building—that make it difficult to define, posing a challenge for policymakers. Early HMT systems are already widely in use—in airline autopiloting systems, for example—but more sophisticated approaches are actively under development. Policymakers overseeing military modernization efforts will increasingly be asked difficult questions about when to deploy HMT and how to effectively monitor its actions.

The artificial intelligence technology at the heart of HMT is rapidly advancing, but two key technologies needed to deploy HMT responsibly—namely, methods to properly test and evaluate these systems, and to generate explanations for how AI “teammates” make decisions—are far less mature. The gap between the potential performance of HMT applications on the one hand, and the need for greater testing and explainability on the other, will be critical for policymakers to address as HMT systems are more widely developed and deployed.

Understanding HMT

Military applications of HMT already exist. A system known as MAGIC CARPET, for example, aids U.S. Navy pilots attempting to land on aircraft carriers touch down safely. Ordinarily, landing on an aircraft carrier requires hundreds of minute corrections and subjects pilot to huge stress. With MAGIC CARPET, the pilot maintains the aircraft’s flight path, while the computer takes care of those hundreds of other adjustments. By enabling the machine to perform tasks at which it excels, the number of corrections is reduced to the single digits, and pilots are able to focus on their key objectives—maintaining situational awareness and orientation to the landing site.

Fundamentally, HMT partners humans and machines as teammates, and incorporating machines into a “team” requires elevating machines from the role of “tool.” But delineating where an ordinary human-machine-tool relationship becomes a human-machine-team relationship is far from simple. One definition from the HMT literature describes machines as teammates when “those technologies that draw inferences from information, derive new insights from information, find and provide relevant information to test assumptions, debate the validity of propositions offering evidence and arguments, propose solutions to unstructured problems, and participate in cognitive decision making processes with human actors.” Another set of criteria proposes that machines operating as teammates must pursue the same goals as their human counterparts, be able to affect the current state, and coordinate action with human team members. By contrast, tools by this definition “handle inputs, not goals; require direct instruction for action; only complete assigned functions.” Some scholars of the technology define HMT in terms of goals and potential or ideal state, rather than trying to create a comprehensive definition. Others focus on characterizing HMT as a capability or an area of study, rather than defining it. While these efforts are important to development, they are likely to leave the policymaker wanting.

National-security researchers and technologists broadly predict that HMT applications will involve some forms of artificial intelligence and machine autonomy. Predicted applications include pairing human teams with autonomous drone swarms overhead to maintain battlespace awareness and information advantage, placing autonomous aircraft in formation with human leads, and search and recovery for combat casualties. Proposed applications employing HMT outside of combat include observing and optimizing training and managing logistics or using automated convoys to move supplies. Indeed, AI and autonomy may be crucial elements of HMT applications but should not be conflated with it. AI can make recommendations that increase warfighter survivability and enable mission success without being a trusted teammate or more than a tool in a toolbox. Autonomous capabilities may enable a system to act without human input or intervention, but that does not necessarily mean it is part of a team or working side-by-side with a human.

Testing, evaluation, validation, and verification

One of the current challenges for human-machine teaming lies in assessing the performance of machine teammates. Decisionmaking systems need to be able to react to complex and evolving situations, which are difficult to predict in training and testing. The process of assessing machine teammate performance—testing, evaluation, validation, and verification (TEV&V)—is crucial both to assess the system’s deployability and to determine the appropriate rules to govern it. Nonetheless, HMT currently lacks the necessary frameworks for appropriately testing the technology.

To highlight the importance and the nuances of TEV&V, consider the following scenario. An operator is told that her machine teammate has 90% accuracy. But what does that mean? This may seem obvious—the machine teammate will succeed at its intended task nine times out of ten—but the reality is likely more complicated. For example, computer vision applications may be highly accurate in clear weather with good visibility. However, in poor weather or low visibility conditions, that performance is likely to degrade. Across all tests, the failure rate may be 10%, but in poor visibility it may degrade to a 100% failure rate.  

TEV&V allows developers and evaluators to test for other issues that may causes problems. For example, if the weather is poor and the machine teammate cannot identify targets, does it gracefully exit search mode and follow another protocol? Or does the machine teammate get stuck in an endless loop of searching because the criteria for the next step is never met? While this is an oversimplified example, it is through TEV&V that behaviors like this should be identified.

Although factors such as performance and accuracy are key to making decisions about using HMT applications, there is currently no standard or understanding of how much or what kinds of TEV&V are sufficient to make these determinations. Human-machine teams complicate TEV&V by their sheer complexity and potential for variability. Understanding and evaluating the performance of individual teammates does not extrapolate to the team as a whole. Teammates are likely to behave differently while interacting and failure can be due to individual team members or team interactions (among other things). Additionally, human teammate performance and reliability is difficult and expensive to evaluate since humans do not produce automated responses and metrics the way machines can. Thus, someone has to manually capture the human performance metrics and ensure they are a representative set across users, potential scenarios, and possible combinations of these factors.

It is also not possible to test all situations and combinations of factors a human-machine team may face, which means it is impossible to document every potential failure state. The result is that human operators may not know how to best enable their machine teammate in all situations because of unknown failure conditions. If a system learns or adapts over time, this becomes even less feasible since developers may not be able to predict how it will change. Also, as with humans, improved accuracy and individual performance may not always mean better teaming. Realizing those improvements relies on the ability of the teammates to translate those improvements into the team dynamic and their usefulness to the team.

Testing and evaluation is further complicated by the fact that AI models that underpin HMT applications may learn and evolve, changing behavior so fundamentally that previous TEV&V may no longer be relevant. A continuous TEV&V approach may support operators teaming with learning systems, but this is an expensive and time-consuming approach. Policymakers will need to both prioritize and fund such efforts and determine how learning systems should be continuously updated in keeping with their importance to the capability, the system, and the potential risks.


The ability to conduct effective TEV&V is further hampered by the lack of explainable AI systems in HMT applications. “Explainability” refers to a system’s ability to provide a rationale for its decision. Such a rationale could take the form of a comprehensible rule set or the creation of a model with deterministic outcomes. Regardless of how it is implemented, explainability is the idea that the human operator can understand why a machine took a specific course of action. In the world of machine teammates, this is difficult since many systems are black box functions that do not provide intelligible reasoning on why a given decision was made. This is particularly problematic for policymakers since understanding why a machine teammate behaves in a certain way is often integral to oversight and accountability.

As David Gunning and his colleagues noted as part of DARPA’s program on explainable AI (XAI), explainability is also important for facilitating operator trust in AI-enabled systems. In the case of HMT, if a human operator of an HMT system does not understand why her machine teammate made a given decision, she will probably be less likely to trust it. And if HMT systems are to deliver the gains in decisionmaking speed that they promise, operators must trust decisions made by the machine. Otherwise, they are likely to delay decisions and action. Moreover, if an operator does not understand why her machine teammate made an error, she may be less likely to trust it in future. And when a mission is successful, operators, developers, and policymakers will want to understand the machine’s decision processes, if only to replicate the success. Without explainability, it is difficult or impossible to prevent repeated mistakes or ensure repeated successes. Since the operators and developers may be unable to determine why a machine encountered a failure state, they may be unable to prevent that same situation in the future.

Explainability also makes it easier for users to understand whether a machine is making decisions based on inappropriate criteria. Consider a machine teammate using computer vision to identify targets for airstrikes. The computer vision system confuses two buildings and designates the wrong one for targeting but gets lucky—the wrong building also contained enemy combatants. The operators may never know the system encountered a failure state and picked the wrong building. While this may seem inconsequential if the mission objectives are achieved, it could have disastrous effects in future deployments if models are built on faulty data without the knowledge of those who operate and/or design the system.

Unfortunately, explainability is not as simple as asking a machine to explain why it made the choice it did. The machine or AI doing a task has no underlying understanding or conceptual model of what it is doing. As Randy Goebel and his colleagues highlight in their work on explainability, the AI in DeepMind’s AlphaGo is able to beat human players at Go, an incredibly complex game highly dependent on strategy, but it does not understand what it is doing or why a given action optimizes its outcome. It only knows that, based on the current set of inputs, a given set of responses will increase the chances of success. For a human, this set of algorithmic routines is incomprehensible. One well-known instance of this is AlphaGo’s famous match against world-leading Go player Lee Sedol, in which the machine made what is now known as “Move 37.” This move was so strange that bystanders thought the AI had made a mistake. However, as the game unfolded and AlphaGo won, it became clear that move 37 had both turned the tide for the machine and changed human understanding of the game.

Further complicating HMT evaluation, there currently exists a tradeoff between accuracy and explainability. More complex algorithms tend to have higher accuracy, since they are able to better capture complexities of the phenomena they are modeling. However, the more complex a model is, the more difficult it is to explain, especially to a human operator. This tradeoff requires developers and policymakers to assess whether the priority for teaming machines is explainability or accuracy.

Implications for policymakers

The promise of HMT for national security lies in the capabilities it enables. As machine teammates become increasingly adaptive and sophisticated, these capabilities will expand accordingly. In order to appropriately evaluate new applications, policymakers will need to understand the capabilities being enabled. The current state of explainability and TEV&V for machine teammates needs the support of policymakers to catch up to the expanding deployment of human-machine teams. Without an understanding of what constitutes sufficient and effective TEV&V, policymakers and end users face significant uncertainty regarding machine teammates. Without explainability, machine teammates face an increased risk of making serious mistakes repeatedly and in unexpected, potentially disastrous ways.

To appropriately support HMT, policymakers will need to prioritize funding and development of TEV&V and explainability on pace with HMT development. Some of these efforts already exist, such as DARPA’s XAI efforts or the Defense Innovation Unit’s guidelines for TEV&V. Building on these initiatives has the potential to meaningfully advance the development of HMT systems.

The potential advantages of HMT capabilities in national security are likely to be crucial in the future of warfare and defense. But the cost of getting HMT wrong is also likely to set those capabilities back significantly. Therefore, as a national security priority, policymakers need to foster not only greater development of HMT itself, but also methods for TEV&V and explainability that are able to mitigate the risks associated with HMT capabilities.

Julie Obenauer Motley is a senior analyst at The Johns Hopkins University Applied Physics Laboratory.