Natural language processing may provide a new perspective on effective teaching

A teacher watches as a student writes on a white board.

Classroom observations are used in districts across the country to evaluate whether and to what extent teachers are demonstrating teaching practice known to support student engagement and learning. Data generated from classroom observations also provide teachers valuable feedback and support their skill development. In some contexts, such as Washington, D.C., public schools, such information is also used in high-stakes personnel decisions, including whether to retain a teacher.

The challenges of measuring teaching practices

Measuring “good teaching” in consistent and fair ways, however, is not easy. One reason is that teaching is not static. How a teacher teaches varies depending on the instructional content and the goal of a lesson. For example, we might want to see extended discussions among students in some lessons and more teacher-focused instruction when introducing new content. Student-teacher interactions also evolve over the course of a school year. Yet, most teachers are only observed one or two times a year. The dynamic nature of instruction makes it challenging to characterize one’s teaching from only a few lessons. Even if these observed lessons can capture one’s “typical” or “average” teaching, feedback based on such information might not be that useful, as what a teacher needs to support students in November may well be distinct from that same teacher’s need in May.

An additional complication with typical observation policies is that good teaching practices are not all visible. Human observers may find it challenging to keep track of and assess all important instructional interactions in busy classrooms. Even expert, highly trained raters for research projects struggle to keep track of multiple teaching behaviors at the same time, and ultimately only prioritize readily observable aspects of teaching that might not be the most substantively important at supporting student learning. In practice, districts mainly rely on less-trained, more time-strapped principals to conduct classroom observations or walkthroughs. Time and resource constraints are compounded by the fact that principals tend to use a small range of scores and are reluctant to rate teachers as performing poorly, making it even harder to gauge teaching quality objectively from typical observation policies and practices.

Novel analysis methods may provide a new perspective on effective teaching

Recent technological advances provide a potentially invaluable complement to inherently limited human-based classroom observations. In our newly published paper in Educational Evaluation and Policy Analysis, we set out to test this possibility. Our idea is simple: Since language is at the heart of many teaching interactions, we might be able to leverage the power of computers to analyze the linguistic features of classroom discourse and directly derive measures of teaching practices. If such automated measures are proved to have equal or even superior measurement qualities as conventional classroom observations, it could be possible to provide far more consistent and ongoing information to teachers about their practice than a human-rater-based system ever could, with lower costs and a larger potential for scale.

Indeed, in many other research fields, text-as-data methods, or natural language processing, have been widely applied to study conversation features, such as those that can improve the success of job interviews, change someone’s opinion by forming persuasive arguments, or address issues related to mental illness. In education, scholars have also successfully applied these methods to study a wide range of topics, including features of productive online learning environments, teachers’ perceptions of student achievement gaps, and strategies that schools adopt in reform efforts. Yet, the use of such methods is much rarer in natural classroom settings and teacher evaluation and improvement efforts.

From classroom transcripts to markers of quality instruction

In our study, we hired professionals to transcribe nearly 1,000 English language arts classroom videos collected during the Measures of Effective Teaching project. These lessons featured 258 teachers teaching 4th and 5th grades in six school districts that mainly serve minority and low-income students. From these transcriptions, we built two types of measures of teacher practices. The first set of measures focus on patterns of discourse, such as how often a teacher and students take turns in their conversations and how often a teacher uses analytical language (e.g., words that reflect cognitive mechanisms, such as “cause,” “know,” and “hence”). To construct these measures, we primarily use information about language sources (e.g., teachers or students), time stamps, and words and punctuation marks associated with a specific linguistic category. The second set of measures we developed capture more substantive aspects of teaching, such as how well a teacher mirrors her students’ language and how much of classroom discourse is focused on instruction-related topics.

We then summarized these variables into a few instructional factors. While teachers might change the way they teach from one lesson to another, we averaged across lessons for a teacher to get a portrait of their typical teaching style, as a starting point. Three factors or types of classroom discourse emerge—a classroom management format that points to a teacher spending much classroom time establishing routines and managing student behaviors; an interactive instruction format that features many open-ended teacher questions and abundant back-and-forth interaction between a teacher and students; and a teacher-centered instruction format featuring much teacher-talk and minimal student participation.

These instructional factors seem to be intuitive, but do they really capture features of teaching aligned with human raters’ observations? Our finding is yes. These three classroom formats consistently show alignment with many of the domains and dimensions identified by several popular observation protocols, including the Classroom Assessment Scoring System (CLASS), the Framework for Teaching (FFT), and Protocol for Language Arts Teaching Observations (PLATO). For example, the classroom management factor has the strongest correlations with the behavior management dimensions in both CLASS and PLATO. The interactive instruction factor is primarily related to the CLASS domain of instructional support, which emphasizes teachers’ use of consistent feedback and their focus on higher-order thinking skills to enhance student learning. The teacher-centered instruction factor, which represents less desirable teaching practices, has negative and statistically significant correlations with instructional dialogue (CLASS), establishing a culture for learning, engaging students in learning, using questions and discussion (FFT), and intellectual challenge (PLATO). To be clear we are not advocating for a single, optimal allocation of instructional time or a single discourse style. Different approaches likely have different utility across a lesson or school year. These findings reflect the average discourse style across a teacher’s lessons.

Beyond correlating our machine-generated measures with classroom observations, we also tested whether these instructional factors predict a teacher’s contribution to student achievement. Notably, the teacher-centered instruction factor negatively predicts teachers’ value-added scores computed using SAT-9, a test designed to measure higher-order skills.

New technologies may improve evaluation—and teaching—for future generations

Although our findings show that text-as-data methods are a promising approach to measure teaching practices, our study is just an initial step toward an automated system to measure and ultimately support quality teaching. The measures we developed are limited and their precision needs improvement. Even with more refined computer algorithms, voice recognition technologies, and a fuller range of measures, districts and schools wanting to implement a text-based system will need to invest in the initial infrastructure that allows them to record, transcribe, and analyze classroom data while preserving the privacy of language data before they can benefit from the proposed methods.

More research is also needed to understand how principals and teachers perceive automated measures and respond to the information they provide. However, our work shows that it is feasible to complement conventional classroom observations using a text-as-data approach.

Once such a system is in place, we can well imagine a world where automated metrics of teaching are produced in real time, and principals and coaches focus their time on helping teachers make sense of the information provided and identifying strategies for improvement. We are not advocating that such measures should be used for consequential evaluations for teachers. However, more and better feedback for teachers about their instruction may well be instrumental in ensuring all students have access to consistent, high-quality instruction.