Moving beyond p-values to help school districts make smarter decisions

A blackboard with math equations

Imagine that your school district is about to adopt a new ed tech product. The senior decisionmakers in the district see themselves as data-driven leaders, so they consult research on the impact of the new product. Maybe they commission a pilot test of their own. But the research does not provide the clarity you hoped for. The outcomes for students or teachers who used the product look fine, but the analysts in the research office tell you that the impact estimates are not statistically significant. Something along the lines of p-values being too large, the sample too small, or the test inconclusive. So you go with your gut, or just ask some teachers if they enjoyed using the product to guide your decision. And you get discouraged and turned off from conducting pilots or using others’ quantitative research to decide what works in the future.

We might take away from this story that the research failed, but what failed was the interpretation of the research. The classical approach—commonly known as a frequentist approach—that every evaluator learned in graduate school involves formally testing if a null hypothesis is rejected with overwhelming evidence or equivalently constructing an oddly mysterious thing called a confidence interval, which can be a large and unhelpful range of numbers that almost certainly has the right answer in there somewhere.

This default approach often does not answer the questions district leaders are asking. There are other ways to analyze data and present results, and my colleagues and I have shown in an online experiment that simply altering the presentation of results changes the decisions that people make.

Bayesian interpretations of data align with natural inferences from data

What if we presented the same information differently? As one alternative to the commonly used significance test, a Bayesian approach to inference assesses the likelihood that a particular hypothesis is true, given the data we observe. Bayesian approaches are often more aligned with how people make inferences from evidence, and are thus aligned with school- and district-level leaders’ decisionmaking processes. (See this short video for an introduction to Bayesian methods.)

Using a Bayesian approach, we’d first estimate a distribution of the likelihood of different values of the effect of the technology. We then calculate the probability that the effect is larger than a particular threshold. For a pilot of a new ed tech product, we’d report the probability that the new technology has a positive impact relative to the status quo. This conveys that there is uncertainty in our estimate of the effect, but presents that uncertainty in a more intuitive way than notoriously misinterpreted p-values, the most widely used tool for determining statistical significance.

My colleagues and I at Mathematica have been working closely with school districts as we developed the Ed Tech Rapid Cycle Evaluation Coach, a new toolkit designed to help school districts evaluate the effectiveness of the ed tech products they use in their schools. Our team hypothesized that people would make different decisions when the same results were presented with these two alternatives. To test this, we ran an online experiment based on the example introduced above.

Each participant went through two scenarios, one for math and one for reading, each presenting a new product for which there was some supportive evidence of improvement over business as usual. All participants saw a bar graph with the group averages (see figure below) and a statement describing the average difference: “Your data specialist tells you that on average, the students in the classrooms that used the new MathCoach software scored 10.31 points higher on the year-end tests than the students in the classrooms that used MathTech.”

For one subject, they saw a frequentist interpretation: “The 95% confidence interval of the difference in test scores between the two groups of classrooms ranges from -19.01 to 39.65. The data specialist tells you that because the range of the confidence interval includes 0, they cannot reject the hypothesis that MathCoach has the same effect as MathTech.”

And for the other subject they saw a Bayesian interpretation: “There is a 77% chance that the new technology improves achievement, and a 23% chance that the new software decreases achievement.”

After each subject-specific scenario, we asked whether the participant would keep the existing technology, switch to the new one, or needed more information to decide. (For more information about the experiment, see our slides from a recent presentation at the Society for Research on Educational Effectiveness conference.)

Based on our experiment, people do make different decisions–participants are substantially and statistically significantly more likely to switch to the new technology when they see the Bayesian interpretation of the data, choosing the new technology 64 percent of the time versus 31 percent of the time when they see the more conventional frequentist framing. The underlying evidence is the same, so it’s interesting and perhaps alarming that the decisions are so different depending on the framing.

Different interpretation leads to different actions

Why does this matter? As schools and districts aim to use evidence and adopt evidence-based practices, one likely scenario is that current programs can remain in place unstudied while new programs are held to higher evidence standards. This gives current programs an advantage and that advantage is intensified if the default presentations of evidence nudge decisionmakers towards inaction.

In this case, we chose to present findings that are weakly positive. With my set of preferences and my interpretation of the scenario, switching is the right choice, but it would not be the right choice for everyone. Whether it’s the right decision depends on many factors, including transition costs and local preferences. To support good decisionmaking, it’s important to consider exactly how evidence will be interpreted under different framings and to strive to provide rigorous, relevant, and actionable information, tailored to the questions at hand.

This initial experiment is just a start in this direction. Our goal was to see whether there was any support for our hunch that simply changing the presentation can influence how evidence is translated into decisions. This study has many limitations—we used a convenience sample, obtained through Amazon’s Mechanical Turk, used a particular set of findings that implied that switching is the “right” choice, and tested only a few different presentations. Further work will examine different ways of presenting both traditional and Bayesian results and will vary the magnitude of the differences between groups and the level of confidence in the inferences.

Our experiment focused on the ability of Bayesian inference to provide results in intuitive probability statements. Bayesian approaches also provide a framework for rigorously combining evidence from multiple sources, can improve inference and precision in small samples, and can be used adaptively to learn more quickly what works and for whom in prospective studies (see this video for an overview of adaptive randomization). These methods should be part of our toolkit as we work to provide educators and policymakers with rigorous evidence, and I’m excited that Mathematica is investing in understanding how and when to best use them.