Anyone participating in the education policy debate for five years or more probably staked out their position on the use of value-added (or student achievement growth) in teacher evaluations long ago. That’s unfortunate, because, as has happened with research on climate change, there has been a slew of new research, especially in the last three years, on the strengths and weaknesses of such measures. Given what we have learned, one wonders whether there would have been more consensus by now on the appropriate use of test-based measures in teacher evaluation if the debate had not started out so polarized.
On statistical volatility (or reliability) of value-added
Remarkably, there is no disagreement about the facts regarding volatility: the correlation in teacher-level value-added scores from one year to the next is in the range of .35 to .60. For those teaching English Language Arts, the results tend toward the bottom end of the range. For those teaching math, the results tend toward the top end of the range. Also, in middle school and high school, where the number of students taught in a given subject is larger, the stability of the measures tends to be higher.
Critics of value-added measures frequently cite year-to-year volatility as a primary reason for not using such measures for evaluating individual teachers. Indeed, if the measurements were so volatile that student achievement gains in one year were completely unrelated to a teacher’s likely success with future students, they would be right.
That is simply not the case. For many purposes, such as tenure or retention decisions, it is not the “year to year” correlation that matters, but the “year-to-career”—that is, the degree to which a single year’s value-added measure would provide information about a teacher’s likely impact on students over their future careers. It turns out that a year-to-year correlation of .35 to .60 implies that a single year of achievement gains is correlated .60 to .77 with a teacher’s average student achievement gain over their career. (The year-to-year correlation is diminished by the fact that each single year is subject to measurement error. Such errors are averaged out over a career.)
For example, in a forthcoming analysis in three school districts where it was possible to track teachers’ student achievement growth over many years, Doug Staiger and I found that of those teachers who were in the bottom quartile of value-added in a single year, 55 to 65 percent were in the bottom quartile over their careers and 82 to 87 percent were in the bottom half.
Therefore, although they are subject to volatility, value-added measures do have predictive power. They do provide information about a teacher’s likely future success with students. Consequently, such evidence does deserve some weight in a supervisor’s decision about whether or not to retain a teacher, even if it is not the sole factor.
On the role of unmeasured student traits
As we all know, students are not randomly assigned to teachers or to schools. The Measures of Effective teaching study confirmed considerable differences in the baseline achievement of students assigned to different teachers, even in the same schools. Indeed, such tracking persists over multiple school years, as some teachers are assigned higher- or lower-achieving students than their colleagues year-after-year.
Fortunately, most state data systems make it possible to track individual students’ scores over multiple years and to control for prior student achievement. Indeed, the whole point of “value-added” measures is to control for observed traits such as students’ prior achievement and characteristics.
However, skeptics have raised appropriate questions about whether such controls capture all the relevant traits which are used to sort students into different teachers’ classrooms. A frequently cited paper by Jesse Rothstein in the Quarterly Journal of Economics in 2010 correctly points out that such selection on so-called “unobserved” student traits could lead to bias. Do some teachers receive students year-after-year that that are different in other ways that are much more difficult to control for?
There have been three new studies in recent years which test that concern directly. And, again, remarkably, there’s been little dispute about the findings. One study by Raj Chetty, Jonah Rockoff and John Friedman studied what happened when high value-added or low value-added teachers moved across schools or across grades. If a teacher’s apparent success was due to his or her students (and not to the teacher’s talent and skill), then we should not see scores move when a particularly high value-added (or low value-added) teacher moves between schools or grades. However, they found that scores do move when teachers move. In fact, the magnitude of the changes in achievement are indistinguishable from what we would predict if the value-added measures reflected causal teacher effects.
Two other studies—one involving 79 pairs of teachers in Los Angeles (which I wrote with Douglas Staiger) and the Measures of Effective Teaching study involving 1,591 teachers in six different school districts (which I wrote with Dan McCaffrey, Trey Miller and Douglas Staiger)—randomly assigned teachers to different groups of students within a grade and subject in a school. In both studies, we used teachers’ effectiveness as measured using value-added methods in prior years. We then tested whether the teachers who had been identified as more effective using the value-added measures had students who achieved more following random assignment. They did. And, in fact, the differences were statistically indistinguishable from what one would have predicted based on the value-added measures.
We should know even more in the coming months: The Chetty et. al. findings are currently being replicated in at least one other school district. And a Mathematica study is due out soon studying the impact on student achievement when high value-added teachers were offered bonuses to move and randomly assigned among a set of schools that had volunteered to hire them.
On the long-term life consequences of high achievement growth teachers
Even if convinced by the evidence of the causal effects of test-based measures and their predictive power, skeptics have raised questions about the quality of the tests being used. Multiple choice tests are inherently limited. Skeptics—and many parents—worry about whether the teachers who generate success on those tests have any long-term positive impacts on children.
There are two new studies which shed light on these concerns. One, by the same team of researchers who studied teachers switching between schools (Chetty, Friedman and Rockoff), tracked teachers long-term effects on student earnings. They found that teachers had long-term effects on student earnings. But even more remarkable, they found that a teacher’s impacts on future student earnings were associated with their effectiveness as measured by value-added.
A second study, recently published in the Proceedings of the National Academy of Sciences (PNAS) by Gary Chamberlain, using the same data as Chetty and his colleagues, provides fodder both for skeptics and supporters of the use of value-added: while confirming Chetty’s finding that the teachers who have impacts on contemporaneous measures of student learning also have impacts on earnings and college going, Chamberlain also found that test-scores are a very imperfect proxy for those impacts. Only a fraction of the impact teachers’ have on earnings and college going is mediated through their apparent effect on test-based measures.
On the comparative advantages of other measures
We have also learned a lot about the alternatives to value-added measures—especially, classroom observations and student surveys—in the past three years. Many of the same concerns about reliability and bias due to unmeasured student traits apply to these as well. For instance, in the Measures of Effective Teaching project, we learned that even with trained raters, a single observation of a single lesson is an unreliable measure of a teacher’s practice. Indeed, the reliability that we saw with single classroom observations (around .4) would have been on the low-end of the reliability of value-added measures. Higher reliability requires multiple observers and multiple observations.
In sum
Reasonable people can disagree on how to include achievement growth measures in teacher evaluations, such as whether the weight attached should be 20 percent or 50 percent, but it is no longer reasonable to question whether to include them. For a number of reasons— limited reliability, the potential for abuse, the recent evidence that teachers have effects on student earnings and college going which are largely not captured by test-based measures—it would not make sense to attach 100 percent of the weight to test-based measures (or any of the available measures, including classroom observations, for that matter). But, at the same time, given what we have learned about the causal impacts on students and the long-term impacts on earnings, it is increasingly hard to sustain the argument that test-based measures have no role to play, that the weight ought to be zero. Although that may have been a reasonable position five years ago, when so many questions about value-added remained unanswered, the evidence which has been released since then simply does not support that view.