Evaluating Teachers with Classroom Observations: Lessons Learned in Four Districts

Teacher in classroom

The federal government has spurred the creation of a new generation of teacher evaluation systems at the state level through more than $4 billion in Race to the Top funding to 19 states and No Child Left Behind (NCLB) accountability waivers to 43 states. A majority of states have passed laws requiring the adoption of teacher evaluation systems that incorporate student achievement data, but only a handful of states had fully implemented new teacher evaluation systems as of the 2012-13 school year.

As the majority of states continue to design and implement new evaluation systems, the time is right to ask how existing teacher evaluation systems are performing and in what practical ways they might be improved. This report helps to answer those questions by examining the actual design and performance of new teacher evaluation systems in four urban school districts that are at the forefront of the effort to meaningfully evaluate teachers. 

Although the design of teacher evaluation systems varies dramatically across districts, the two largest components of these systems are invariably classroom observations and student test score gains. An early insight from this examination of district teacher evaluation data is that nearly all the opportunities for improvement to teacher evaluation systems are in the area of classroom observations rather than in test score gains. 

Despite the furor over the assessment of teachers based on test scores that is often reported by the media, in practice, only a minority of teachers are subject to evaluation based on the test gains of students. In this analysis, only 22 percent of teachers were evaluated on test score gains. All teachers, on the other hand, are evaluated based on classroom observation.

Improvements are needed in how classroom observations are measured if they are to carry the weight they are assigned in teacher evaluation. The report’s authors make specific, evidence-based recommendations aimed at improving the fairness and accuracy of teacher evaluation systems. Key findings and resulting recommendations include:

  • Under current teacher evaluation systems, it is hard for a teacher who doesn’t have top students to get a top rating. Teachers with students with higher incoming achievement levels receive classroom observation scores that are higher on average than those received by teachers whose incoming students are at lower achievement levels, and districts do not have processes in place to address this bias. Adjusting teacher observation scores based on student demographics is a straightforward fix to this problem. Such an adjustment for the makeup of the class is already factored into teachers’ value-added scores; it should be factored into classroom observation scores as well.
  • The reliability of both value-added measures and demographic-adjusted teacher evaluation scores is dependent on sample size, such that these measures will be less reliable and valid when calculated in small districts than in large districts. Thus, states should provide prediction weights based on statewide data for individual districts to use when calculating teacher evaluation scores.
  • Observations conducted by outside observers are more valid than observations conducted by school administrators. At least one observation of a teacher each year should be conducted by a trained observer from outside the teacher’s school who does not have substantial prior knowledge of the teacher being observed.
  • The inclusion of a school value-added component in teachers’ evaluation scores negatively impacts good teachers in bad schools and positively impacts bad teachers in good schools. This measure should be eliminated or reduced to a low weight in teacher evaluation systems.