A recent article in the New York Times describes how the statistical revolution that has swept professional baseball in the last decade has become so pervasive that long-time radio broadcasters are being replaced by announcers who can communicate to fans about advanced statistics. Most education reformers would select Waiting for Superman as their favorite education film. But for those whose passion is boosting teacher quality, the hands-down winner is the 2011 film about the use of statistics in baseball, Moneyball. Indeed, these reformers see a day in which district leaders who don’t understand and use advanced statistics to shape their teacher workforce will be as anachronistic as baseball broadcasters who can’t fluently discuss B.A.B.I.P. (batting average on balls in play).
We are clearly moving in that direction, at least in intent. Spurred on by the Obama administration’s $4.5 billion Race-to-the-Top state grant program and the subsequent No Child Left Behind (NCLB) state waivers, over two-thirds of the states in the nation have made commitments to the federal government to institute teacher evaluation systems that sort teachers into different levels of performance with associated consequences. For example teachers persistently in the top tier in terms of evaluation scores would be paid more and teachers persistently in the bottom tier would be replaced.
This sounds like a statistical revolution in that decisions are to be based on data rather than politics, intuition, union contracts, or the way it has always been done. And it certainly seems promising compared to existing practice in which teachers are seldom subject to an evaluation and all get passing scores, pay is determined by years on the job rather than performance, and almost no-one ever is dismissed for being a bad teacher. But the devil is in the details. A new generation of teacher evaluation systems won’t work without the right data being used smartly.
The Brown Center at Brookings is in the middle of a project in which we are examining the actual design and performance of new teacher evaluation systems in several large urban school districts scattered across the country. We’re asking whether there are significant differences in the design of these systems across districts, whether any such differences have meaningful consequences in terms of the ability to identify exceptional teachers, and whether there are practical ways that districts might improve the performance of their systems. We’ll have a lot more to say about this project later this year. Here I want to share some initial findings that are interesting to me and that may be of use to the many districts and states around the country that are just starting to create, design, and implement new teacher evaluation systems. This is work that our cooperating districts have been at for a couple of years. Lessons learned from them should help those just getting started.
Most of the action isn’t in value-added
You would think from the majority of the media coverage of teacher evaluation and the wrangling between teacher unions and policy officials that the new teacher evaluation systems being implemented around the country are principally about judging teachers based on their students’ scores on standardized tests. “Value-added” is the shorthand for this and has been a bone of contention in almost every effort to replace existing teacher evaluation systems, which declare everyone a winner, with new systems that are designed to sort teachers into categories of effectiveness.
For example, the nine day teachers strike in Chicago in the fall of 2012 was reported as having been driven by teachers’ objection to a proposed “evaluation system that judged them by the test scores of their students.”[i] The Chicago Teachers Union felt they won a major concession in the final contract because the proportion of a teacher’s evaluation based on test scores was reduced to 30% from the 45% proposed by the City. Similarly, the union representing teachers in New York State stalled the state’s agreement under its NCLB waiver to institute a teacher evaluation system statewide. The main sticking point was whether 40% of a teacher’s evaluation would be based on test score gains of the teacher’s students, as proposed by the state. The final agreement reduced this number to 20%.
In the districts we’re working with less than 20% of teachers can be evaluated based on their students’ test scores. Why? Under NCLB, states have to administer annual tests in language arts and mathematics at the end of grades 3-8. These are the “tested grades and subjects.” Third graders haven’t been tested before the end of third grade. With only a score at the end of the year and no pretest their gains can’t be calculated. Gain scores can be computed for 4th though 8th graders by subtracting their score at the end of the previous grade from their score at the end of their present grade. But by 6th grade students are in middle school, which means that they have different teachers for different subjects. Thus their gain scores on mathematics and reading can’t be allocated to a single teacher. Thus only 4th and 5th grade teachers in self-contained classrooms who remain the teacher of record for a whole year can be evaluated based on the test scores of the students in their classrooms. Every other teacher has to be evaluated some other way.
It gets worse if a district makes the reasonable decision to increase the reliability of its evaluation system by requiring at least two years of value-added data on a teacher as the minimum for making a high stakes decision such as denial of tenure. Because large numbers of teachers move between grades, schools, and in and out of the profession, particularly in big urban districts, the proportion of the teacher workforce that can be evaluated with two years of value-added data may fall to only about 10%.
Returning to Chicago and assuming that no more than 20% of Chicago teachers could be evaluated based on the test score gains of their students, the Chicago strike was about whether test scores would carry at most a weight of .09 (as originally proposed by the City) or .06 (as eventually agreed to by the City and the Union) in the overall evaluation system for all teachers.
If you like your coffee black, I can see you making a fuss if someone tries to add some milk. But having a bare-knuckle fight over whether it is going to be 6 or 9 drops of milk doesn’t make a lot of sense. Either the parties in these disagreements don’t understand the minor role that value-added can play in teacher evaluation systems given the small proportion of teachers on which it can be calculated, or the war is about something else with value-added simply being a convenient symbol.
The something else is likely meaningful evaluation at all. Student test score stats, flawed though they are, happen to provide the best predictions of future teacher performance and later student outcomes that are currently available. Even though student test score gains attributable to individual teachers can only be calculated for a small proportion of the teacher workforce, these stats are the anchor for the rest of the teacher evaluation system. For example, in the Gates Foundation’s Measures of Effective Teaching project, the validity of teacher evaluation scores based on classroom observations is assessed by their correlation with value-added scores from the same teachers. This is also how the Brown Center has previously proposed that the performance of all teacher evaluation systems be evaluated. Test scores are the one component of the evaluation system that has a known property from teacher to teacher, school to school, and district to district. Without it, at least for now, the meaning of any other component of the evaluation system is easily challenged. So it is that those who want to reform teacher evaluation want value-added and those who prefer the status quo don’t. It isn’t about how many drops of milk to add to the coffee, even though it seems to be – it’s about whether there will be any milk at all.
I want milk in the coffee – value-added adds value – but we need to pay more attention to the quality of the coffee itself. Those who advocate for meaningful teacher evaluation should be investing in and fighting for classroom observation systems and other sources of information on teacher performance, including student ratings of teachers, that are good enough to be used in high stakes decisions about teachers.
The districts we’re working with all use home-grown classroom observation systems that almost surely could be improved, and they’re using processes for collecting classroom observations that differ substantially across districts. For example, some districts have only building principals carrying out classroom observations, others have only master teachers doing this work, and others have a mix. Some conduct six classroom observations a year for each teacher, while others carry out only two. Do these different design decisions have consequences for the performance of the evaluation system? We need to know.
All of the classroom observation systems across the districts with which we are working are one-size fits all, which means that the high school algebra teacher is being evaluated on the same generic skill set as the kindergarten teacher. I’m sure that in addition to assessing generic teaching skills we need content-specific and grade-specific observation systems – does the math teacher know how to teach math and does the kindergarten teacher know how to create a classroom environment that is appropriate to 5-year-olds?
There is a lot of work to be done to provide school districts with the building blocks of evaluation systems that are good enough both to withstand the political and legal challenges they will face and to identify exceptional teachers reliably. This is an effort that must be carried out in the trenches. It lacks the glamour of the headline reform, which is replacing everybody-is-a-winner systems with systems that are predicated on there being meaningful differences among teachers in effectiveness. But if we don’t attend to building evaluation systems that work well for all teachers, not just those for whom value-added can be calculated, the headline reform is at risk of failing.