Catalysts for Learning: Recognition and reward programs in the public schools

Educational policymakers are increasingly trying to hold schools accountable for the academic performance of their students. The most widely publicized approach to school accountability is “school choice.” Choice advocates argue that empowering parents to choose among public (and sometimes private) schools will make schools more competitive and thus more accountable. A second approach, adopted with less fanfare by a number of states and localities, is to set up programs that recognize and reward excellent schools within the public school system. Recognition and reward programs provide signals and incentives much like those envisioned for a choice-based school system, with the purpose of improving all schools in the system.

In 1984 South Carolina embarked on a major education reform, which included a statewide program to recognize and reward excellent schools within the public school system. Seven years later, the Dallas Independent School District set up a similar program as the centerpiece of education reform.

Though these two recognition and reward programs differ in some important respects, they both measure student achievement and then rank schools by how effectively they increase student performance. Both then give the most successful schools financial rewards–direct bonuses for teachers and other school staff in Dallas, discretionary school funds in South Carolina. In both, the least effective schools are subject to increased scrutiny and attention aimed at improving them as well.

Are the incentives working? It is too soon to hazard a final answer, but both programs have now been in effect long enough to provide instructive insights into the strengths and weaknesses of a recognition and reward approach designed to encourage everyone within a school to work together toward the main goal of education–academic learning.

Measuring Performance

The first step in holding schools accountable for student performance is measuring that performance. Few educators today regard standardized tests as a fully adequate assessment tool. They prefer newer forms of assessment that put more emphasis on higher-order thinking and problem-solving skills. Several states, including Kentucky, have moved to more ambitious forms of assessment. But Kentucky’s system, and others like it, consume so much teacher time and are so expensive that they can be implemented in only a few grades. Furthermore, difficulties in comparing the results in one school with those in another make such assessments ill-suited to evolving accountability systems. For the near future, at least, the student performance measures will have to be based largely on the results of standardized tests. The challenge is to improve those tests so that they assess more accurately the advanced skills that students need.

Translating student performance into a fair measure of school performance requires taking into account wide variations in the socioeconomic backgrounds of students in different schools. It is no secret that student achievement is highly correlated with family background, most notably, for example, with the education of parents. Equality of Educational Opportunity, the landmark 1966 report on America’s schools by James Coleman, showed that the socioeconomic characteristics of students explain far and away the largest share of measured differences in educational achievement. Measuring school performance without adjusting for variations in students’ backgrounds would wrongly attribute all the blame for poor student performance, or all the credit for good performance, to the schools themselves.

One solution is to shift the focus from the actual test scores of students to the gains in test scores that can be attributed to the operation of the school. But this approach too poses problems. One is technical: how to isolate the component of performance attributable to the school. Accurate measures of key background variables, such as family income or parental education, are typically not available. The measures that are at hand–race or whether the child is eligible for the federally subsidized lunch program–are rough proxies only. The other problem is political. Adjusting test scores by background variables can be misinterpreted as a signal that some groups of students are less able to learn than others and are being held to lower standards.

Measurement Problems

When South Carolina launched its recognition and reward program, state officials decided to rank schools by their students’ performance “gains” rather than by actual scores. To measure gains, they used year-by-year data on the test scores of individual students by subject area. A student’s gain in, say, fifth-grade math was calculated as the difference between her actual fifth-grade math score and the score statistically predicted for her based on her fourth-grade test scores in math and reading. A school’s “gain” index was then calculated as the median value of student gain indexes for all children enrolled in that school for the whole year.

State policymakers were disappointed to find that the gain index, which they expected to be bias-free, favored the schools serving the most affluent students. They attempted to counter that bias by dividing the schools into five clusters defined primarily by the socioeconomic backgrounds of their students as measured by eligibility for free and reduced price lunches. The most effective schools were judged to be those in the top quarter of each cluster.

Like South Carolina, Dallas based its incentive program on year-by-year student test data. But unlike South Carolina, it used elaborate statistical techniques to purge test scores of the effects of students’ socioeconomic and racial status, including limited English proficiency and whether students are approved for a subsidized lunch. It also took into account the movement of students into and out of a school, as well as overcrowding in certain schools. School gain scores were calculated as the mean of the student gain scores for all students in the school during the year. As a result of all the adjustments, Dallas was able to rank schools with virtually no bias as to the students they served. All schools had an equal chance to win an award and be recognized as effective. But the statistical analysis underlying the adjustments has proved so complicated that few people in the district fully understand it. Indeed, the program’s complexity undermines its incentive effects by making it hard for school officials to make the link between the things they can directly observe or affect and the types of performance needed to win an award.

Table 1 compares the effects of various ways of measuring school performance. The table is based on test score data for all fifth graders in South Carolina for whom fourth-grade test scores were also available. Each row represents a different ranking of schools based on the fifth-grade scores. The numbers in each row are correlations of the school rankings, based separately on reading and math scores, with the percentages of the students at each school who are receiving free lunch or who are black. The closer the correlations are to zero, the freer the ranking is of bias against schools serving those students. In the first row, the school performance measure is simply the school’s average fifth-grade test score. The large negative correlation coefficients indicate a strong bias against schools serving low-income and minority students.

Rows 2, 3, and 4 show what happens when school performance is measured by gains in scores, rather than actual scores. The performance measure in row 2 is the simple change in test scores from fourth to fifth grade. Row 3 approximates the South Carolina methodology, row 4 that for Dallas. In all three rows, the correlation coefficients are much closer to zero than those in row 1, thus providing a compelling case for measuring school performance by test score gains. The Dallas approach is clearly the most bias-free, though, as noted, its complexity and its potentially adverse signals, for example with respect to race, raise problems of another sort.

Neither South Carolina nor Dallas relies solely on the aggregated student test scores in judging school effectiveness. South Carolina considers student and teacher attendance and dropout rates in determining the size of the award winners receive, and Dallas also takes into account schoolwide measures such as student attendance and promotion and dropout rates.

Incentives and Unintended Effects

The goal of a recognition and reward system is to shape the behavior of school personnel in ways that will increase student learning. But, like any incentive program, it can be gamed or corrupted. It may also lead to narrow “teaching to the test” and, by creating losers as well as winners, may lower teacher morale. Anecdotal evidence from South Carolina and Dallas suggests that these problems are real. But experience in both states also shows how program design can help mitigate them.

One way to reduce the harmful effects of teaching to the test is to link the test closely to an appropriate curriculum. Others are to use more than a single test and to develop better assessment systems. Because Texas has a state curriculum and a statewide assessment tool (the Texas Assessment of Academic Skills, or TAAS), Dallas can tie its accountability system to the state curriculum. Dallas policymakers also use multiple outcome measures, including a nationally normed test, various end-of-year tests at the high school level, and SAT participation rates and scores. South Carolina uses only two tests, one a state criterion-referenced test and the other a nationally normed test, and gives the tests in different years, so its program may be more subject to the problems of teaching to the test than is the Dallas program.

South Carolina’s classification of schools into five groups to determine effectiveness is based in part on variables under the control of school officials. Not surprisingly, anecdotes abound about how schools have tried to affect their placement– and their chances of winning the $15,000-$20,000 award that goes to the most effective schools. In Dallas, concern that schools might try to manipulate the set of students taking the tests has evoked a stern warning that any school trying to do so will be disqualified from the Awards Program. The use of gain scores rather than average scores reduces the likelihood of such manipulation by making it harder to determine which students could be expected to dampen the school’s performance. Manipulation is even more difficult when the test scores are purged of the effects of family background.

In Dallas, teachers and principals in winning schools receive $1,000, secretaries and janitors $500. Given the relatively high stakes, it is not surprising to find outright cheating. At least twice Dallas officials have found evidence that school personnel tampered with tests to try to raise student performance. The existence of the longitudinally matched sample of student test scores helps Dallas officials uncover such flagrant cheating.

School-based incentive programs are not likely to erode teacher morale as badly as merit pay for individual teachers. But morale problems clearly remain. Awards in both Dallas and South Carolina are based on a schoolÕs performance relative to other schools. Thus, teachers can work hard and increase student performance more than in previous years only to find that teachers in other schools did even better. The result can be extreme frustration, anger, and disappointment, especially when few schools are winners. To combat that problem, Dallas introduced a two-tier system in 1994-95, awarding smaller grants to teachers and staff in schools whose performance exceeded predicted performance, thus boosting winners from 20 percent to about 50 percent.

Are Recognition and Reward Programs Working?

Have the recognition and reward programs increased student achievement? Because South Carolina-s program was part of comprehensive reform, it is nearly impossible to isolate its effect. But Dallas offers some evidence for evaluating its program.

A carefully controlled comparison of trends in student performance in Dallas on the TAAS with trends in other large Texas cities is encouraging. During 1991-94, Dallas pass rates on the TAAS for seventh graders increased some 10-12 percent more than those in Austin, El Paso, Fort Worth, San Antonio, and Houston. That finding is complicated by the fact that pass rates went up more in Dallas than in the other cities even during 1991- 92, when the Dallas program was just being launched. Comparing pass rates between 1992 and 1994 reveals positive effects on the order of 10 percent for Hispanic and white students, but not for blacks. Moreover, Dallas pass rates rose more relative not only to the average of the five other large Texas cities, but also to specific cities within that group engaged in their own significant local reform efforts.

The higher pass rates do not by themselves prove that the program increased student learning. They could simply reflect better test-taking skills. But other measures also suggest that something good is happening in the Dallas school system. For example, attendance rates rose faster and dropout rates fell faster in Dallas than in other large Texas cities.

Of even greater interest is whether incentive programs will lead to more fundamental long-term changes in the schools that would provide a solid base for increased student learning in the future. Do the programs lead teachers to push for and receive better professional development? Do they encourage schools to find better ways to get parents involved or to identify and correct problem areas more quickly? Do they encourage better leadership in the schools? Although it is not yet possible to answer the first two questions, an answer to the last seems forthcoming: the turnover rate of principals in Dallas schools has increased greatly since 1991-92. Given the well-documented importance of school leadership and assuming the new principals are more capable than the old ones, the more rapid turnover could bode well for future performance.

Some Questions Remain

Experience with recognition and reward programs in Dallas and South Carolina offers several clear lessons. First, if such programs are introduced, school rankings must take into account differences in studentsÕ socioeconomic backgrounds. Second, the programs can and must be designed to mitigate certain potentially adverse effects. Third, evidence from Dallas implies that, at least for some groups of students, student performance can be improved.

But uncertainties remain. As the Dallas experience shows, trying to make the ranking system scrupulously fair to all schools generates measurement indexes that are difficult, if not impossible, for most people to understand, thus weakening the incentive effects of the program and undermining confidence in it. The challenge is to develop simpler measures of school performance that nevertheless take appropriate account of the background of the students.

Most experience with recognition and reward programs has been in the South, where teacher unions are not strong. As a result, they have been introduced from the top down. Given the key role of teachers in the delivery of education services, their support is crucial for the long-term success of such programs. Yet teachers are understandably nervous about being held accountable for the learning of students who may have little incentive to learn.

Recognition and reward programs force the state or district to compile a great deal of information that can be used to improve the management of schools. The challenge here for managers is both to make productive use of that information and to assure that it is not misused. Indeed, a primary argument for recognition and reward programs is to improve school management. The danger arises that statistical methods suitable for ranking schools may be inappropriately extended to smaller units, such as specific classrooms, where small sample sizes make their use problematic.

School accountability and incentive programs like those in South Carolina and Dallas can serve as useful catalysts for greater student learning. But they cannot stand alone. For schools to respond best to new incentives, teachers will need to have new skills and access to good opportunities for professional development, school officials will need more control over their resources and policies, everyone will need better information about educational strategies that work, and state policymakers will need the capacity and resources to provide support for low-performing schools. Only if they are reinforced by such complementary reforms will recognition and reward programs meet their full potential in improving the nation’s public schools.

Catalysts for Learning: Recognition and reward programs in the public schools

Subscribe to the Economic Studies Bulletin

Catalysts for Learning: Recognition and reward programs in the public schools

Helen F. Ladd Helen F. Ladd Former Brookings Expert, Susan B. King Professor Emeritus of Public Policy, Samford School of Public Policy - Duke University

Helen F. Ladd