Beyond a number: How qualitative accountability can make a difference to schools

Students sit in the library of the university KU Leuven in Leuven

The Every Student Succeeds Act (ESSA) challenges states to enhance their accountability systems while overcoming the limits of test-based accountability by requiring that systems include at least one school quality or non-test-based student success indicator. Ten states have adopted a framework for accountability that includes measures of school climate. Some states, including Vermont and the District of Columbia, are experimenting with qualitative evaluation of schools, already used by some districts (including Cleveland, Denver, and Oakland) and several charter management organizations.

Objectives of qualitative review

How can qualitative evaluation and feedback enhance a state’s accountability system? School inspection systems, common in the U.K. and other countries, provide one model for enriching test-based accountability with better information about schools. A well-designed approach to qualitative accountability has the potential to support policymakers in assessing how well a school’s work aligns with its goals, including social and emotional learning and youth development in addition to academic outcomes.

To fully meet this potential, state and district leaders must decide upon the goals of the system and design the metrics and tools accordingly. Systems can focus more heavily on ensuring adherence to known “best practices,” or they can focus more on providing feedback to schools on their own continuous improvement efforts. The system can also provide feedback to district and state leaders in two ways. First, due to its alignment with student learning goals, it can serve as a leading indicator for changes in school performance to channel additional support or resources. Second, it can serve as a tool to evaluate how well district and state supports to schools are working (e.g., targeted improvement efforts such as leadership coaching or additional resources for individualized instruction).

The New York City Department of Education (NYCDOE) is an early pioneer in Quality Review (QR) in the states, and its experience could offer important lessons for other states and districts beginning to experiment with how to design and implement a QR system aligned with their goals. The NYC model shares some common features with school inspection systems. QR entails a 2-3 day visit by an outside, experienced educator, who observes classrooms, meets with school stakeholders, and provides feedback to the school on areas of strength and growth. The particular goal of QR in NYC, as part of an autonomy for accountability exchange, was to focus on how schools organized themselves as autonomous problem-solving units engaged in continuous improvement. Therefore, the QR standards were broad and based on strategic alignment between a school’s activities and its goals, as opposed to a narrow set of prescriptive rules. The department also sought to create “leading” indicators of changes in student performance.

I conducted an evaluation of QR in NYC to determine how well its system achieves those objectives, especially given the high cost and burden on schools of the reviews.

Does it capture elements of school quality, including measures beyond test scores, which are meaningful to parents, students, and teachers? Do positive scores on QR predict improvements in test scores? Perhaps most importantly, do the review itself and the feedback received drive increases in student learning?

A natural experiment

The questions of how well QR performs on these goals lend themselves to different research methods. Whether more frequent QR drives changes in student performance or teacher practice is a causal question, best answered through a randomized control trial or a natural experiment, which the setting provides. The question of whether QR scores predict changes in student performance is a descriptive one, which I explain in the next section.

Due to the high direct and indirect costs of annually reviewing every school, the NYCDOE implemented QR between 2009 and 2014 such that high-performing schools were reviewed every few years and lower-performing schools were reviewed annually. It used schools’ Progress Report scores to determine how often schools would be reviewed. As long as schools could not precisely manipulate their Progress Report score (and all evidence suggests they couldn’t), the determination of whether or not a school receives a QR according to this rule should be nearly as good as random near the score cutoff. Therefore, any “jump” in outcome values at the cutoff can be interpreted as a causal effect of more frequent QR. This enabled me to estimate the causal effects of the reviews using a regression discontinuity design.

The graphs below show a visualization of that jump for two outcomes: math scores, and a measure of teacher practice (survey reports of how well the school measures student progress toward learning goals). The x-axis represents how far above or below the cutoff point for receiving an annual QR each school is, and the y-axis represents each outcome in standard deviation units. Schools just to the left of the cutoff were far more likely (about 55 percentage points) to receive a QR in that year. There seems to be a small increase in test scores and a larger increase in the measure of teacher practice.

Causal effect of more frequent Quality Review on math scores
Credit: Author’s calculations.
Causal effect of more frequent Quality Review on student progress
Credit: Author’s calculations.

What is quality review good for?

Regression-based estimates of these relationships allow for more precise statistical control for other factors, including weighting by the probability of actually receiving a QR and adjusting the window of observations used to estimate the effect. Using these methods, it becomes clear that the most robust relationship is that QR seems to affect practice, as measured by teacher reports of measuring student progress, using data to inform instruction, and working on teams. QR does not have a significant effect on academic outcomes including test scores, graduation rates, or even non-academic student outcomes like attendance.

Even though the effects of QR on test scores are small and sensitive to modeling choices, it does appear that the scores schools receive on QR, conditional on receiving a QR due to being below the cutoff, are predictive of changes in test scores. In particular, high scores in areas of curriculum, pedagogy, leveraging resources, positive learning environment, assessment, high expectations, and youth development are predictive of more gains in both math and English/language arts scores. Additionally, family communication, teachers collaborating on teams, and systems for improvement were predictive of English/language arts gains.

Ultimately, then, whether these results are good or bad depend upon the relative weight policymakers assign to the different objectives of QR. It seems that QR is successfully serving as a leading indicator for changes in test scores, especially in the areas highlighted above. There is also evidence that the feedback from QR is driving changes in practice. If these changes in practice have inherent value, beyond their potential linkage to increased student learning (e.g., if they promote more socio-emotional learning or other non-tested outcomes), or if the changes in practice take some time to manifest in the tested outcomes, QR could still be worthwhile. These modest, but still positive, outcomes suggest that further experimentation with QR can be a promising way to drive school improvement and provide more information in school accountability. Of course, these benefits need to be weighed against the system’s costs in order for policymakers to make that decision.