Should noncognitive measures be used for teacher accountability?

States recently submitted proposals for new accountability systems aligned with the Every Student Succeeds Act (ESSA). As policymakers review these and as states implement them, it is essential that they–and we as a full education community–ground these discussions in a broader conversation about the skills that we want students to have when they leave school.

College and career-ready standards under Common Core and similar frameworks ask that students develop higher-order thinking skills that are widely perceived to be important for students’ short-term success in school and long-term success in the labor market. However, these successes also depend on several other skills that can be taught in school: an ability to interact positively with their peers, regulate their own behavior, adopt a growth mindset, persevere in the face of difficulty, etc.

The fact that ESSA requires states to consider at least one “nonacademic” or “noncognitive” factor in the development of accountability policy is consistent with this line of thinking. However, we must be careful that the new ESSA accountability plans also are consistent with the measures currently at schools’ disposal, as well as with research findings on their properties and usability. New evidence that I describe below suggests that these measures are unlikely to be suitable for high-stakes decisionmaking, though they could prove quite useful at enhancing existing school practices around instructional improvement.

Lessons from previous research

Given that the bulk of the work of improving students’ academic performance and noncognitive skills lies in the hands of teachers, it is no wonder that a wave of research has sought to document and test the effect that teachers have on these outcomes. Two key findings stand out. First, teachers vary considerably in their ability to improve students’ academic performance, attitudes, and behaviors, which in turn influences a variety of long-term outcomes, including teenage pregnancy rates, college attendance, and earnings in adulthood. Second, experimental and quasi-experimental research indicates that “value-added” approaches are valid ways to identify effective teachers, at least with regard to teachers’ impact on student test scores. These findings were critically important for Obama-era policies that advocated using student performance data to evaluate teachers and make consequential job decisions.

However, until recently, researchers have not been able to test this latter claim with regard to teachers’ effects on student outcomes beyond test scores. The absence of rigorous validation studies leaves unanswered whether or not these teacher effectiveness measures are “biased”; in other words, whether or not they are confounded with the nonrandom sorting of teachers to students, the specific set of students in the classroom, or factors beyond teachers’ control. Answering this question is essential for continued rollout of policy under ESSA.

New evidence examining the validity of teacher effects

In new experimental work (PDF download) forthcoming in Education Finance and Policy, I examine bias in teacher effects on students’ noncognitive skills–or what I prefer to call “attitudes and behaviors”–using a dataset in which students self-reported a range of attitudes and behaviors in class and where participating teachers were randomly assigned to class rosters within schools. These data allow me to examine the extent to which teachers vary in their contribution to students’ attitudes and behaviors; and the relationship between non-experimental and experimental estimates of teacher effects on these attitudes and behaviors, which produces a measure of bias.

I find that teachers have causal effects on students’ self-reported behavior in class, self-efficacy in math, and happiness or engagement in class. Teachers identified as 1 standard deviation (SD) above the mean in the distribution of effectiveness improve these student outcomes by as much as 0.35 SD. In other words, teachers at the 84^th percentile in the distribution of effectiveness move the median student up as high as the 63^rd percentile in performance on these measures. Weak correlations between teacher effects on different student outcomes indicate that these measures capture unique skills that teachers bring to the classroom.

However, value-added approaches to estimating these teacher effects appear to be insufficient to account for bias in some cases. For example, non-experimental and experimental estimates of teacher effects on students’ self-efficacy in math are correlated around 0.5, falling short of the ideal 1:1 relationship and indicating that the non-experimental measure contains roughly 50 percent bias. Correlations are similar between non-experimental and experimental estimates of teacher effects on students’ happiness or engagement in class. One exception is teacher effects on students’ behavior in class, with correlations quite close to 1.

These findings are not particularly sensitive to different control variables used when predicting teacher effectiveness ratings, including prior achievement, prior attitude or behavior, and demographic characteristics. Given that these are the variables typically available in education datasets, it is not clear that we could easily reduce bias through other approaches.

Implications for policy

Where does this leave policy? Should these measures be used for accountability, despite concerns about bias? On one hand, incorporating skills beyond test scores–and teachers’ ability to improve them–into accountability frameworks acknowledges the importance of these measures.

Building up rich administrative datasets that include a range of student outcomes also will help initiate a wave of research examining what works in education. Administrative datasets hosted by state and district agencies are the backbone of education policy research, yet to date have focused almost exclusively on capturing student test score outcomes. Expanding the scope of these datasets is, in my opinion, a worthy goal.

At the same time, using biased measures in high-stakes settings surely will raise concern for several stakeholders. Despite convincing evidence against bias in teacher effects on students’ academic performance, teachers and some policymakers still are skeptical about their use and the fairness of these measures. In turn, teachers may be less likely to use measures of their effectiveness in order to improve their instruction and, ultimately, improve student outcomes.

Instead, measures of students’ attitudes and behaviors may be most useful for low-stakes decisionmaking within schools, particularly for instructional improvement. For example, bringing costly but effective development programs such as teacher coaching to scale requires that school leaders know individual teachers’ strengths and weaknesses as well as which teachers require immediate support. Teacher effects on students’ academic attitudes and behaviors provide this information and could allow schools to allocate professional development dollars strategically, as opposed to investing in lower-cost but less-effective programs that reach all teachers.

ESSA’s guidelines for incorporating nonacademic indicators into regular tracking of student and school progress provide an important infrastructure for education research, policy, and practice. To work as intended, policy incentives must align with and not distort the core mission of these efforts: providing districts, schools, and teachers with information that helps them meet the multifaceted needs of all students.