Last week, the American Education Research Association, the largest professional association representing education researchers, issued a statement on the use of value-added measures (VAMs) of performance for evaluating educators or educator- preparation programs. While the statement falls short of blacklisting VAMs, it does caution against their use and specifies eight technical criteria to be satisfied before endorsing the adoption of such measures in teacher evaluation.
AERA’s statement adds to the cacophony of voices urging either restraint or outright prohibition of VAMs for evaluating educators or institutions. Doubtless, these stakeholders are genuinely concerned about potential unintended consequences of adopting these performance measures. We share these concerns, but bring a slightly different perspective. One must view the value of any particular performance measure in the context of all other measures, not relative to a nirvana that does not exist.
Statements like AERA’s prompt us to ask what alternative performance measures VAMs are being compared to? Often, we fear, VAM get criticized for failing to meet idealized standards that few performance measures could possibly meet. While more comparative research across fields would be valuable, our research has found the reliability of VAMs are consistent with performance measures used in other occupations about as complex as teaching.
We also wonder if status quo measures tend to get a pass. States have longstanding practices for evaluating teacher preparation programs that involve reviews of curricular materials and visits to talk with professors charged with teacher education. In the case of evaluating individual teachers, states and localities have long relied on classroom observations. Maybe these work well and maybe they don’t (there’s a fair bit of evidence that they don’t differentiate at all, we leave it to the research community to draw conclusions about this), but surely they should face the same scrutiny as new alternatives.
Let’s return to AERA’s statement, which mentions that “there are promising alternatives currently in use…[including] teacher observation data and peer assistance and review models” (p. 4). Do we actually know much about these methods of evaluating teachers or teacher preparation institutions? The statement isn’t backed up by any cited evidence. So why are the alternatives considered “promising” while VAMS aren’t? In fact, the available evidence on these measures suggests they have their own shortcomings as performance measures and are less predictive of students’ test achievement to boot. This is not terribly surprising given that they often fail to differentiate between teachers, despite the received wisdom of parents and empirical evidence that teachers in fact differ markedly in their impacts on students. Contrast this with multiple studies causally linking VAMs with not only short-term test scores but also long-term student outcomes.
AERA’s conclusion that more research would be valuable on understanding what adverse consequences using VAMs might entail, how they may be used in conjunction with alternative measures, and whether alternative measures are valid measures is our conclusion too. But let’s not overlook the already ample evidence supporting the use of VAMs as a performance measure, particularly in contexts where they do not carry more than 50% of the weight in high-stakes consequences, which is true in every jurisdiction we know of.
The bottom line is that we join AERA’s call for continued research into VAMs, but believe that this call should be broadened to include any performance measure. We also caution that we should not let the perfect be the enemy of the good. All performance measures are imperfect. If we are looking for a performance measure that has zero errors, we ought to abandon performance evaluation altogether. The ultimate way to judge the efficacy of different measures is whether they help policymakers make decisions that lead to better outcomes for students. Ultimately we cannot know whether new evaluation systems – VAM or other means of judging performance – work well without trying them and assessing their effects.
The Brown Center Chalkboard launched in January 2013 as a weekly series of new analyses of policy, research, and practice relevant to U.S. education.
In July 2015, the Chalkboard was re-launched as a Brookings blog in order to offer more frequent, timely, and diverse content. Contributors to both the original paper series and current blog are committed to bringing evidence to bear on the debates around education policy in America.