It seems that everywhere I turn these days, there’s another book or article reproaching our reliance on averages, and, by extension, the “gold standard” of scientific research–the randomized controlled trial (RCT)–in education. The two concepts are intertwined, since RCTs rely on both random assignment as well as averages (or means) as the statistic of importance in order to minimize bias and maximize our ability to reject or not the “null hypothesis” (the default assumption that our intervention did not make a difference to those who received it).
The idea that individual differences–like “jaggedness”, “context”, and “multiple pathways”–mean more than averages is especially compelling to those (like me) who understand the benefits of personalized learning. If students learn best from differentiated, individualized instruction, then we would likewise know more from individualized or personalized measurement, right?
Unfortunately, it’s more complicated than that. While I recognize that probably no one is, in fact, average, and that RCTs can’t tell us everything, I also think that we are not all as unique as we would like to believe, and that we can learn from each other’s successes and experiences. Thus, I am hesitant to throw the baby out with the bathwater for a few simple reasons, outlined below.
RCTS HAVE EVOLVED
The very first RCTs were described like this:
|Treatment group||O (No treatment)||X (Treatment)||O|
In other words, sample a large group of people, randomly divide them into two groups, give one of these groups your treatment or intervention (“X”), and measure the outcomes of both groups at the end.
At first, random assignment was trusted to remove all biases, and there was less effort made to understand whether any given sample represented an entire population or some subset of the population. Today, our design of RCTs are far more sophisticated, with much more attention being paid to
fidelity (or, what is X and are we implementing it the way we intended?), as well as the counterfactual, (what is O, and how is it different from X, if at all?). In fact, we now commonly refer to this counterfactual as a comparison, rather than control group, to acknowledge that this condition often is not actually “the absence of treatment.” We are now more careful to understand the extent to which random assignment failed to remove all possible group differences, and we also pay attention to the possibly biasing effects of attrition. We also more carefully sample, and spend more time understanding just which population our sample is representative of. All in all, the term “RCT” today refers to a much more nuanced sampling procedure and study design than at first, and because of this, we document and pay more attention to internal and external validity, improving our ability to understand causal relationships and generalize them.
AVERAGES HELP SEPARATE SIGNAL FROM NOISE
In the social sciences, the outcomes we are most interested in are difficult to measure. We cannot precisely nor directly measure cognitive processes in the brain, like learning. Therefore, without averages, we cannot estimate what differences in outcomes are due to error in our measures, or chance, and what differences are real. As much as we would love to focus in on outliers and determine what is causing their anomalous results, the fact is we cannot separate the signal from the noise if we focus our research entirely on the individual level. Even single-case studies, which are designed to follow an individual, rely on replication in multiple individuals before claiming true relationships between treatments and outcomes.
This is not to say that RCTs can tell us everything we need to know about causal relationships and the efficacy of treatments. It is essential that research goes further to investigate outcomes that are different for certain groups of individuals than for the main or overall group. It is also critical to understand the treatment and measure the conditions that may advantage or disadvantage certain treatments in certain contexts. This is why we are concerned about sampling, measuring demographic and other individual characteristics, measuring pretreatment group differences, measuring attrition, and measuring fidelity within the treatment group and measuring the comparison interventions as well, so that our understanding of causal relationships are nuanced and detailed and we do not accidentally paint our findings and interpretations with too broad of a brush.
ANALYTIC METHODS HAVE ALSO EVOLVED
The average of today is not your grandparents’ average. Due to advances in analytic methods, we are now able to model many more of the complexities of a single average. Whereas previously we were limited by assumptions of independence (i.e., our math required us to assume that my score on a test was unrelated to my classmate’s score on that same test), now we can recognize and account for multiple levels of interrelatedness (e.g. learners nested within teachers, nested within schools), as well as multiple contextual factors (mediators and moderators) into our analyses. Practically, what that means for today’s studies is that we have much less of an analytic need to remove or separate outliers when calculating an overall effect. In fact, these analyses can tell us about multiple effects, including the differing effects for different types of learners, or in different learning environments. All in all, analyses have evolved so we are less reliant on a single, overall, average.
THE GOAL OF RESEARCH IS DIFFERENT THAN THE GOAL OF IMPLEMENTATION
I acknowledge that much of the above has probably been old information for researchers, and frustrating information for practitioners. The truth is, research doesn’t give us the answer, or even an answer. It can only tell us what is most likely to work for most students of a given type most of the time. In other words, research allows us to know what to begin with, so that most learners are supported, and then what and how to individualize for each learner. Because of averages, we won’t have to just throw spaghetti at the wall and see what sticks, we’ll have an idea for each learner, of what to try first, then next, and next after that.
Ultimately, it is not the case that averages and personalization work in opposition to each other. In fact, averages support our efforts to personalize learning, because it is only through averages and RCTs that we can know what’s most likely to work for individual students. This doesn’t mean that in implementation we stop there. In implementation, we do dig deeper. We measure and individualize our instruction beyond averages, to each student’s needs.
But let’s not trivialize the usefulness and value added by averages and RCTs. We wouldn’t be where we are today without them, and we would not be able to make scalable, sustainable progress in personalizing instruction if we do not continue to conduct, understand, and apply what we learn from them.
The Brown Center Chalkboard launched in January 2013 as a weekly series of new analyses of policy, research, and practice relevant to U.S. education.
In July 2015, the Chalkboard was re-launched as a Brookings blog in order to offer more frequent, timely, and diverse content. Contributors to both the original paper series and current blog are committed to bringing evidence to bear on the debates around education policy in America.