Passing Muster: Evaluating Teacher Evaluation Systems

Steven Glazerman, Dan Goldhaber, Susanna Loeb, Stephen Raudenbush, Douglas Staiger, Grover J. “Russ” Whitehurst, and Michelle Croft

Passing Muster Calculator

The calculator allows users to determine the percentage of the total teacher workforce who can be identified as exceptional based on the characteristics of the teacher evaluation system.

excel_icon.jpgDownload the calculator »

Executive Summary

U.S. public schools are in the early stages of a revolution in how they go about evaluating teachers. In years past there was little more than intuition and anecdote to support the view that teachers vary in their quality. The little data that was available came from ratings of teachers carried out by school principals, a process that typically resulted in nearly all teachers receiving uniformly high ratings. It is nearly impossible to discover and act on performance differences among teachers when documented records show them all to be the same.

A new generation of teacher evaluation systems seeks to make performance measurement and feedback more rigorous and useful. These systems incorporate multiple sources of information, including such metrics as systematic classroom observations, student and parent surveys, measures of professionalism and commitment to the school community, more differentiated principal ratings, and test score gains for students in each teacher’s classrooms. The latter indicator, test score gains, typically incorporates a variety of statistical controls for differences among teachers in the circumstances in which they teach. Such a measure is called teacher value-added because it estimates the value that individual teachers add to the academic growth of their students.


Value-added has a prominent role in new evaluation systems for several reasons, including a burgeoning research literature that demonstrates that value-added measures predict future teacher ability to raise student test scores better than principal ratings and teacher attributes such as years of experience or advanced coursework.  Further, federal law and policy in the George W. Bush and the Obama administrations has incentivized states to develop the assessment systems and databases that allow value-added to be calculated, and to incorporate value-added information as a significant factor in evaluating teacher performance.  For example, a commitment to developing new teacher evaluation systems incorporating value-added information was required of states competing for the billions of dollars of Race to the Top funds that were available under the American Recovery and Reinvestment Act during the first year of the Obama administration. 

Although much of the impetus for new approaches to teacher evaluation comes from policymakers at the state and national levels, the design of any particular teacher evaluation system falls to the roughly 16,000 school districts and 5,000 independent public charter schools in the U.S. that have the responsibility for developing human resource policies and procedures for their instructional staff.  Because of the immaturity of the knowledge base on the design of teacher evaluation systems and the local politics of school management, we are likely to see considerable variability among school districts in how they go about evaluating teachers—even as most move to new systems that are intended to be more informative than those used in the past.

If an individual state or the federal government wishes to require or incentivize local education agencies to evaluate teachers more rigorously and meaningfully, how can they do so while honoring each district’s authority to do it its own way?  And how can individual school districts benchmark the performance of their teacher evaluation system against the performance of evaluation systems in other districts or against the previous version of their own evaluation system?  In other words, how can teacher evaluation systems be compared, one to another?

This report addresses the comparison of teacher evaluation systems in the context of a particular administrative and legislative challenge:  How a state or the federal government could achieve a uniform standard for dispensing funds to school districts for the recognition of exceptional teachers without imposing a uniform evaluation system on those districts.  We address and provide practical procedures for determining the reliability of local teacher evaluation systems.  We then demonstrate that the reliability of the evaluation system determines the proportion of teachers that a system can identify as exceptional.  Thus a school district wanting to accurately recognize the top quartile of teachers as highly effective would only be able to identify some portion of the top 25 percent with confidence given the lack of perfect reliability in the measures of teacher effectiveness that are deployed by the district.  Further the portion of the top quartile that could be identified would be greater in an identically sized district that has more reliable measures. 

Our approach to answering the question of how the federal government or a state could dispense funds to local districts to reward exceptional teachers is that the amount of funding should be scaled to the reliability of each district’s evaluation system.  Thus school districts with more reliable systems would be able to accurately identify a greater proportion of their teachers as exceptional and would receive funding in line with those numbers.  The procedures we propose by which the reliability of a district-level teacher evaluation system could be reported to and evaluated by state or federal officials are straightforward and simple.  They do not necessitate an intrusive federal or state role.

Although we provide a worked solution to a specific administrative challenge, i.e., state or federal funding for district-level recognition programs for exceptionally effective teachers, the underlying approach we offer has more general uses in a variety of circumstances in which decisions have to be made about teachers based on imperfect data.  For example our approach is easily adapted to the district-level task of identifying low-performing teachers for intensive professional development or to the state- or federal-level task of setting minimal standards for the reliability of teacher evaluation systems.  Further, by demonstrating how the reliability of non value-added measures of teacher performance such as classroom observations is an important component of the overall reliability of a teacher evaluation system, our approach provides a spur to the development of multi-faceted methods for evaluating how well teachers are doing their jobs.  



The vast majority of school districts in the U.S. presently use teacher evaluation systems that result in nearly all teachers receiving uniformly high ratings.  For instance, a recent study by The New Teacher Project of twelve districts in four states revealed that more than 99 percent of teachers in districts using binary ratings were rated satisfactory whereas 94 percent received one of the top two ratings in districts using a broader range of ratings.[i]  As Secretary of Education Arne Duncan put it, “Today in our country, 99 percent of our teachers are above average.”[ii]  

The reality is far different from what these evaluations would suggest.  We know from a large body of empirical research that teachers differ dramatically from one another in effectiveness.  The failure of today’s evaluation systems to recognize these differences means that human resource decisions are not as productive or fair as they could be if they incorporated data that meaningfully differentiated among teachers.  To put it plainly, it is nearly impossible to act on differences between teachers when documented records show them all to be the same.

A new generation of teacher evaluation systems seeks to make performance measurement and feedback more rigorous and useful.  As such, the measures should demonstrate meaningful variation that reflects the full range of teacher performance in the classroom.  New evaluation systems typically incorporate several sources of information on teacher performance.  For example, the Hillsborough County Public School District in Florida utilizes classroom observations of teacher performance, student ratings of their teachers, direct assessments of teacher knowledge, and student test score gains in each teacher’s classrooms as components of their teacher evaluation system.[iii]  The District of Columbia Public Schools evaluates teachers based on student test score growth in each teacher’s classroom, a classroom observation measure, a rating of commitment to the school community, student test score gains for the whole school, and a measure of professionalism that takes into account factors such as unexcused teacher absences and late arrivals.[iv]

Many of these new systems incorporate student test score growth in ways that aim to capture the contribution teachers make toward student achievement.  This contribution is often referred to as teacher value-added.  There are various methods for estimating teacher value-added, but all typically entail some variant of subtracting the achievement test scores of a teacher’s students at the beginning of the year from their scores at the end of the year, and making statistical adjustments to account for differences in student learning that might result from student background or school-wide factors outside the teacher’s control.  These adjusted gains in student achievement, also known as value-added, are compared across teachers.

The prominence of value-added in new evaluation systems is a result of several influences.  Among them is the commonsensical view that because the principal role of teachers is to enhance student learning, a central measure of their job performance should be how much their students learn.  Another influence is a burgeoning research literature on teacher classroom effectiveness that has focused on value-added measures and demonstrated that those measures predict future teacher performance, as measured by value-added in subsequent years, better than teacher attributes such as years of experience or advanced coursework. 

The broader reform community in education has taken up the cause of meaningful teacher evaluation grounded in value-added measures of effectiveness.  Incentives for school districts to evaluate teachers based on value-added are central to the Teacher Incentive Fund that was authorized and funded during the George W. Bush administration and also to the Obama administration’s proposed replacement, the Teacher and Leader Innovation Fund.  Further, the Obama administration made a state’s commitment to measuring teacher performance using value-added a requirement for success in the competition for $4.3 billion in the Race to the Top fund.  The appeal of teacher value-added measures is further strengthened by their wide availability as a result of the No Child Left Behind Act’s requirement for testing of all students in reading and mathematics in grades 3-8 coupled with federal funding to states to develop longitudinal data systems to serve as central state repositories of the resulting student assessment data. 

The availability of student test scores allows, within each state, a common and face-valid yardstick for measuring teacher effectiveness across all schools in the state—an attribute that is not present for the other presently available sources of information on teacher effectiveness that individual school districts might employ, e.g., supervisor ratings or classroom observations.  Thus for a variety of reasons value-added measures of teachers’ contribution to student growth have come to be central to popular and powerfully driven efforts to improve U.S. public schools.

Researchers have pointed out that value-added estimates for individual teachers fluctuate from year to year and can be influenced by factors over which the teacher has no control.  We have previously issued a report that describes some of the imperfections in value-added measures while documenting that: a) they provide one of the best presently available signals of the future ability of teachers to raise student test scores; b) the technical issues surrounding the use of value-added measures arise in one form or another with respect to any evaluation of complex human behavior; and c) value-added measures are on par with performance measures in other fields in terms of their predictive validity.[v]  Our report recommended the use of value-added measures as a part of teacher evaluation but in the context of continuous improvement of those measures and awareness of their imperfections and limitations.

The present report offers advice on how to determine the degree to which an evaluation system is successful in the face of those imperfections and limitations.  We address the connection between the reliability of an evaluation system and its ability to accurately identify exceptional teachers for special action, e.g., for a salary bonus if they are exceptionally good or for remedial action or dismissal if they are exceptionally bad.   Reliability is not the only issue arising from the use of value-added measures. In particular, designers of evaluation systems and policymakers have to address biases that are introduced by differences in the contexts in which different teachers work.  However, in this report, we focus on the issue of reliability.

We build our presentation around a proposal we put forth in a previous report, America’s Teacher Corps, calling for the creation of national recognition for teachers deemed effective based on approved state and local evaluation systems.[vi]  The three design features of that proposal were:

  • Promoting the use of teacher evaluation systems to identify and reward excellence

Whereas most of the focus of teacher evaluation systems using value-added has been on the identification and removal of ineffective teachers, we believe that such systems can also have a major impact by identifying and promoting excellence through recognition of exceptionally strong classroom performance. 

  • Flexibility on the components that would need to be a part of a teacher evaluation system and how those components would be weighted

There is no consensus on the degree to which teacher performance should be judged based on student gains on standardized achievement tests.  Supporters of test-based measures would seek to expand standardized testing to virtually all grades and subjects and weight the results heavily in personnel decisions about teachers.  Opponents question the validity of state assessments as measures of student learning and the accuracy and reliability value-added indicators at the classroom level.  They typically prefer observational measures, e.g., ratings of teachers’ classroom performance by master teachers.  Our proposal for a system to identify highly-effective teachers is agnostic about the relative weight of test-based measures vs. other components in a teacher evaluation system.  It requires only that the system include a spread of verifiable and comparable teacher evaluations, be sufficiently reliable and valid to identify persistently superior teachers, and incorporate student achievement on standardized assessments as at least some portion of the evaluation system for teachers in those grades and subjects in which all students are tested.

  • Involving a light hand from levels of government above the school district

A central premise of our previous report is that buy-in from teachers and utilization of their expertise are most likely if the design of an evaluation system occurs at a level at which they feel they have real influence.  In most cases this will be the local school district where they work.  We expect wariness from teachers, even with respect to a system intended only to identify and reward excellence, if the design of that system is subject to considerable control from Washington or the state level.  Further, we doubt that there is much of an appetite within Congress for the creation of a federal bureaucracy devoted to the fine-grained oversight of state and local teacher evaluation systems.  And we doubt there is sufficient capacity within state-level education bureaucracies to carry out such oversight even if there is a political will to do so.

Suppose a state or the federal government wanted to fund a program whereby individual school districts could provide a bonus or other rewards to their exceptionally effective teachers. This requires a system of evaluation that meaningfully differentiates among teachers based on their performance.  Similarly, suppose that a state wanted to encourage districts to differentiate the teaching profession so that new teachers started with one set of responsibilities but could be promoted into more complex and challenging roles as they demonstrated capability in the job.  This reform, again, requires evaluations to determine different levels of teaching performance.  Given the great variation in design and quality of district evaluation systems and the practical and political constraints on states or the federal government producing uniformity in those systems, how could state or federal funds for such recognition programs be fairly distributed? 

In this report we address the question of how a state or the federal government could achieve a sufficiently uniform standard for dispensing funds for the recognition of exceptional teachers without imposing a uniform evaluation system on participating school districts.  In particular, we address the role of the state or federal government in assessing the reliability of local evaluation systems.  We demonstrate that the quality of the measures and the quantity of data affect reliability and determine the number of teachers a system can identify as exceptional.  Instead of a school district wanting to recognize the top quartile of teachers being able to identify 25 of every 100 teachers as being in the top 25 percent, we show that when imperfections in the measurement system are taken into account, only some portion of the true top 25 percent can be identified with confidence.  Further, that portion would be greater in an identical sized district that has better measures and more data.

Although we provide a solution to what may seem to be a narrowly-focused administrative challenge, i.e., funding a teacher recognition program from the state or federal level, the underlying approach we offer has more general uses to which we will turn in the final section of this report.

How state or federal teacher recognition programs can accommodate district evaluation systems of differing quality

A major source of debate about the methods of estimating teacher performance is the statistical reliability of such measures and whether they are sufficiently precise to support attaching consequences to them such as pay for performance and tenure decisions.[vii]

Our concern in this report is with the reliability of the evaluation system as a whole.  We focus on the information that is necessary to determine the extent to which teacher evaluation systems are likely to result in classification errors (e.g., classifying teachers as highly effective when they are not or failing to classify them as highly effective if they are).  For this discussion we will not address the potential problem of systematic bias in which the evaluations for some teachers are systematically too high or too low in comparison to the teachers’ true effectiveness.  Clearly, a desirable evaluation system will adjust for differences in the classrooms and schools in which teachers work to reduce or eliminate such biases.  We focus here primarily on the reliability (or precision) of the estimates. 

No evaluation approach will be exact for all teachers and thus designers and those using the evaluations should consider the implications of the imprecision for the decisions they make.  If designers were to dismiss any evaluation systems that had error in identification, they would have to dismiss all possible systems and end up with no evaluation at all.  Given that error is a fact in evaluation, understanding the implications of this error and how error varies across different approaches to evaluation can be helpful in choosing an effective approach.

In what follows, we describe how policymakers can determine the number of teachers that would accurately and inaccurately qualify to be singled out for special treatment given the power of the system to predict future teacher performance and the level of teacher exceptionality that is the criterion for special treatment.  We will describe how to estimate the extent of misclassification, as well as the average difference in later effectiveness between groups identified by the evaluation.  We also address how the tolerance that policymakers are willing to permit for misclassification plays a role in the number of teachers that can be accurately identified as exceptional.  These subjects go to the heart of the issue of the performance of district-level teacher evaluation systems relative to each other, and provide the basis for a solution to the challenge of building a fair system for distributing state or federal funds to support district-level programs for teacher recognition.

The factors that influence the accuracy of teacher identification systems

Using teacher performance measures to identify teachers for special treatment is, fundamentally, an exercise in prediction.  For example, the use of measures of past performance of novice teachers to decide who will be tenured assumes that the better-performing novice teachers will be better teachers after receiving tenure than would the lower-performing teachers had they been given tenure.  Likewise, the common district-level practice of selecting a small number of teachers as “master teachers” to serve as role models and supports for beginning or struggling teachers involves the implicit assumption that those teachers are persistently high performers who will continue to be stars in the classroom.  Thus the use of teacher evaluation measures to identify different levels of teacher performance in one period as a basis for personnel action nearly always assumes that identification in one period signifies something about how teachers will perform in the future.

Our approach to judging the relative performance of teacher evaluation systems rests on determining their ability to predict future performance.  We propose to judge teacher performance measures based on the degree to which they accurately estimate future teacher performance from past years of teaching, i.e., how reliable they are as a predictor of future effectiveness.   

A more reliable measure is one that will yield similar answers when it is used to take more than one reading of the same phenomenon. We use the correlation of value-added measures from one period to the next as one component of a gauge of reliability, but the degree to which a performance measure in one period predicts performance in the future will depend on both the degree to which the measure is related to true performance and the extent to which true performance is stable from one period to the next.  Differences between the measurement of performance and true performance are considered measurement error.  If there is a significant amount of measurement error such that the performance measure is only loosely related to true performance, we would refer to it as “noisy.”  If on the other hand true performance changes from one period to the next, even a perfect measure of performance in one period will not accurately predict performance in the next period.  There is no reason to think that true teacher performance is completely stable from one period to the next, e.g., teachers who are quite effective one year may encounter problems at home or changed work conditions that lead them to be less effective in a subsequent year, and teachers may become more effective over time as a result of experience (learning on the job) and professional development. 

For an evaluation system to be useful, the true performance of teachers must be sufficiently stable over a period of a few years for predictions of future performance from past performance to be worthwhile. This assumption is buttressed, in the case of value-added measures, by the fact that value-added measures from one period predict student achievement in future periods.[viii]  It is also buttressed by anecdotal evidence that some teachers are simply more effective than other teachers and, as a result, parents work to get their children into these teachers’ classrooms.  For this discussion, we will assume that true performance, while variable from year to year, is stable enough for there to be meaningful differences in the average effectiveness of teachers over time.

In addition to picking up variation in true performance from year to year, any measure of performance will have error, i.e., will be an imperfect reflection of true teacher effectiveness.  However, while all measures have error, some measures are likely to capture enough of the true differences in teacher effectiveness to be useful.  Indeed, the same studies that permit an inference that there is stability in the true performance of teachers also demonstrate that the measures used in those studies are sufficiently reliable to capture at least some of those true performance differences.   

Because we can neither know the precise degree of error in a given measure of performance nor the actual stability of true teacher ability from one period to the next, the correlation of a school district’s measures of teacher performance from one period to the next cannot be judged against an absolute standard.  Thus, our approach to judging the quality of evaluation systems must be relative: if we use common yardsticks we can demonstrate that some evaluations systems are more reliable than others and by what degree.

Value-added as the common yardstick 

If we are to judge the quality of teacher evaluation systems relative to each other, there must be some common measure across those systems that is sensitive to true differences in teacher performance.  Without such a common measure, the quality of teacher evaluations systems cannot be meaningfully compared across districts. 

The focus of this paper is on using such a common measure to assess the reliability of evaluation systems.  However, it is important to keep in mind that reliability is not useful unless a measure also has validity.  To produce valid scores a measure must pick up differences in teacher performance that are important to student learning.  Thus, while a teacher’s height is strongly correlated from one year to the next, can be measured precisely, and is available for every teacher in the country, it would not be a good common measure for our purpose because it does not capture teacher performance.  Similarly, suppose district A’s evaluation system produced scores for individual teachers based on a weighted average of years of teaching experience, route into teaching, certification status, receipt of advanced degrees, and principal ratings of performance on a pass-fail system.  The correlation of these scores for the district’s teachers from one year to the next would be high, i.e., they would be very reliable.  Suppose that district B deployed a very different evaluation system based on classroom observations, value-added test scores, and student surveys.  The year-to-year correlation of evaluation scores would likely be much lower for district B than for district A.  However, that would not mean that district B had the weaker evaluation system.  In fact, the measures used by district A have been shown empirically to be only weakly related to classroom performance whereas those used by district B have a stronger evidence base.  A system for determining whether a district’s evaluation system passes muster in terms of recognizing exceptional teacher performance should not be designed to favor year-to-year reliability disconnected from what is being measured.

If we are to judge the quality of teacher evaluation systems relative to each other, we have to have a common measure or a set of common measures across those systems that are sensitive to true differences in teacher performance.  Without such common measures, it is difficult to meaningfully compare the quality of teacher evaluations systems.  A number of different measures of teacher effectiveness have at least basic face validity for measuring teacher effectiveness, including: direct measures of teaching such as teachers’ scores on observational protocols of teaching quality; measures of student learning while in a teacher’s classroom such as value-added measures; principals’ ratings; and survey-based assessments of teachers by students and parents.

Currently value-added measures are, in most states, the only one of these measures that is available across districts and standardized.  As discussed above, value-added scores based on state administered end-of-year or end-of-course assessments are not perfect measures of teaching effectiveness, but they do have some face validity and are widely available.  In our analysis below we use value-added as the metric for comparing the quality of evaluation systems; however, we are limited in our goal of comparing systems by the limitations of these measures.  As other measures become widely available and as the tests on which the value-added measures are based become better aligned with societal goals, our ability to judge and compare systems of evaluation will improve

Note that we do not recommend that states or the federal government be prescriptive about the components that districts should include in their teacher evaluation systems or how they should be weighted or how the infor