Can a teacher really impact student height? A cautionary tale on value-added models


Few research methodologies have influenced contemporary educational research, policy, and practice as deeply as teacher value-added models (VAMs). VAMs are used for many different purposes, from academic research on the long-run effects of teachers and factors related to teacher effectiveness to real-world decisions around teacher pay and retention. How these models are estimated depends on the purpose at hand and the data available, but they share the goal of using annual standardized tests to estimate each teacher’s contribution to student achievement growth, net of students’ prior year achievement and demographics.

Given their increasing relevance to educational policy and practice, we believe it’s important to reflect carefully on what VAMs do—and don’t—measure. There has been ongoing debate about whether VAMs successfully separate teachers’ own contributions from other factors influencing student achievement that are outside their control–things like parental supports, student ability, school contexts, or the interactions among students in a classroom.

For a paper released in early December, we took a new approach to examining the so-called “face validity” of traditional VAMs—that is, whether they yield plausible results or results that are clearly incorrect. We do this by estimating several VAM models that are used to measure teachers’ impacts on test scores, but with a twist. Instead of using test scores as the outcome measure, we use measures of their students’ heights.

Of course, we don’t believe it’s plausible that teachers influence how quickly their students grow in stature. So, we found it troubling that some commonly used models return significant teacher effects on height. These implausible teacher effects on height are smaller than—but nonetheless roughly comparable to—estimated teacher effects on test scores, even after adjusting for the factors that VAMs generally account for.

These findings have proven provocative, appearing in the The Washington Post, as well as other venues. We are pleased that our paper has provoked a lively conversation about the uses and limits of VAMs. However, we have seen a number of claims that overgeneralize our findings, suggesting that all uses of all types of VAMs are invalid. This conclusion is not supported by our paper. We believe that our study has some good news about certain VAMs, particularly those used in research. At the same time, however, we see important cautions in our findings for the use of value-added models in policy and practice.

Good news about the use of VAMs in research

Some critics worry that teacher value-added models capture more about the students in teachers’ classrooms than their effectiveness. Had we found correlations between teacher effects on height and other commonly measured student characteristics, our findings would have lent credence to these concerns. It’s impossible to prove a negative, so the absence of these correlations does not prove the absence of bias from other sources that school districts can’t measure. It is somewhat reassuring, however, that we do not find any evidence of these correlations.

That we find teacher effects on height does not, therefore, invalidate the consistent finding emerging from VAM research that teachers matter for students’ achievement and other life outcomes. The most sophisticated VAMs currently used in research aim to measure teachers’ persistent contributions to student outcomes (that is, a systematic effect that is observed year over year). Although the question is not fully settled, these models find that teachers vary substantially in their contribution to achievement growth and that exposure to high value-added teachers has measurable positive effects on students’ educational attainment, employment, and other long-term outcomes. Importantly, we find no teacher effects on height using the more sophisticated models these researchers use.

Cautions for practice

Even so, VAMs may still be problematic when applied in practice. Inspired by the desire to improve teacher evaluation and the initial promise that VAMs offered, many school districts and states have integrated teacher value-added models into their own teacher evaluation systems. This includes encouraging school leaders to consider value-added scores when they make teacher assignments, incorporating value-added scores into teacher pay and retention decisions, and in the most controversial cases, creating systems designed to publicly disclose teachers’ value-added scores. But the models behind many of these real-world applications are often less sophisticated than the ones that researchers use, and our results suggest that they are more likely to result in misleading conclusions.

In many cases, practical applications use single-year models like the models that yield implausible teacher effects on height in our analyses. Our findings reinforce previous work identifying problems with these models, demonstrating that random error can lead observers to draw mistaken conclusions about teacher quality with striking regularity.

The more sophisticated models used in research draw on several years of classroom data to estimate a teacher’s persistent effect. When we use these multi-year models, we do not find effects on height, suggesting that the effects we see in simpler VAMs reflect year-to-year variation that should be seen as random errors rather than systematic factors. But because multiyear models identify the persistent effect, they do not pick up on short-term changes in teacher effectiveness unless it sticks. As such, these models are of limited usefulness for motivating annual performance pay goals since it might be hard for teachers to move them quickly. However, multiyear models may work well for identifying persistently very good or very bad teachers, but only after several years of teaching.

VAMs have underscored the importance of teachers, and we believe that they have a role to play in future educational research and policy. We just need to be more modest in our expectations for these models and make sure that the empirical tool fits the job at hand.