The primary obstacle to faster progress in U.S. education reform is hard to put your finger on, because it’s an absence, not a presence. It is not an interest group or a manifest social problem. It is the infrastructure we never built for identifying what works. It is the organizational framework we’ve not yet constructed for building consensus among education leaders across the country to identify what’s working. Before you roll your eyes at another call for more research by a self-interested researcher, consider the following argument:
1. In education as in medicine, most new ideas will fail.
For the largest pharmaceutical companies, more than 80 percent of Phase II clinical trials failed between 2008 and 2010.[i] Do we have any reason to believe that educational interventions will have a higher success rate? Student learning and the process of adult behavior change in schools are just as complex as the typical disease process—and probably less well understood. It is impossible to anticipate every obstacle and complication. We should anticipate that most new ideas will fail and develop the infrastructure for testing a large number of them on a small scale first.
The concept of a clinical trial—the small scale deployment of a promising idea, with a comparison group for measuring efficacy—is foreign to education. As with Race to the Top, we tend to roll out reforms broadly, with no comparison group in mind, and hope for the best. Just imagine if we did that in health care. Suppose drug companies had not been required to systematically test drugs, such as statins, before they were marketed. Suppose drugs were freely marketed and the medical community simply stood back and monitored rates of heart disease in the population to judge their efficacy. Some doctors would begin prescribing them. Most would not. Even if the drugs were working, heart disease could have gone up or down, depending on other trends such as smoking and obesity. Two decades later, cardiologists would still be debating their efficacy. And age-adjusted death rates for heart disease would not have fallen by 60 percent since 1980.
Sound far-fetched? That’s exactly how our ancestors ended up practicing bloodletting for 2,500 years of our history. It’s been six years since the Race to the Top Initiative, and there’s still no consensus on whether the key ideas behind those reforms are producing progress. How are we ever going to generate momentum for an education reform agenda without systematically testing the various components in limited ways before rolling them out broadly.
2. Even visionaries need evidence to galvanize others.
I have frequently heard practitioners claim, “We don’t need more research. We know what to do, but we don’t have the resources—whether it be funding or time or political courage or regulatory flexibility—to do it.” I’m sure that some of those who make such claims really do have worthwhile ideas they’re championing. However, the collective implication of their dismissal of the need for evidence is a Tower of Babel, in which leaders pursue their own visions, only to be followed by a successor with their own vision of reform. We don’t lack innovation in education. We lack the ability to learn from innovation.
3. Grass-roots, small scale trial and error will never discern the effect sizes we should be expecting.
The latest fad in education reform is to empower small teams of practitioners to seek out solutions within their own settings. It’s an understandable reaction to the high-level policy-driven reforms in education of the past two decades. However, unless those solutions are systematically tested on a larger scale, with plausible comparison groups, those efforts will just add to the confusion.
It is a matter of arithmetic. According to an analysis of a variety of national tests by Black, Bloom, Hill, and Lipsey, the annual measured achievement growth of the average student in grade three onward is a half standard deviation or less. It is the same in math, English, science, and social science. By grade six, the average annual improvement is less than three-tenths of a standard deviation. When a whole year of education and life experience—not to mention physiological changes in the brain—results in changes of less than .3 standard deviations, we should expect much smaller improvements from any classroom changes in a single year.
For example, teachers are on a very steep learning curve during their first few years of teaching while they become familiar with the basics of classroom management, lesson design, and delivery. However, the growth in student achievement associated with such professional learning is less than .08 student-level standard deviations.
It’s probably safe to say that most educational interventions will generate smaller improvements in instruction than the typical teacher undergoes in the first few years of teaching. However, in order to have better than a fifty-fifty chance of detecting a .08 standard deviation improvement, one would need a total sample of 2,000 students (if randomly assigning individual students to treatment and control). The student sample size requirements will be even higher if the experiment involves clusters of classrooms or schools.
If there were 75 students in a treatment group (roughly three elementary school classrooms or the average teaching load of one middle school teacher), the chance of being able to reject an impact of .08 would be roughly seven percent—only slightly higher than the probability of a false positive (using a .05 level for the hypothesis test).
The small scale, atomized, practitioner-driven search for solutions will not yield reliable evidence on its own. Rather, the most promising ideas need to be tested across larger samples of classrooms. Otherwise, we’ll broadcast a cacophony of false positives and false negatives.
4. Federally-funded studies are great for informing federal policy decisions, but we need more investments in evidence by state and local decision-makers.
The centralized model of investing in research may work in medicine, where the federal Food and Drug Administration must approve drugs and where there is a vast network of medical journals and professional societies for disseminating the latest findings. However, in U.S. education, state and local governments make most consequential decisions. And those decision makers are much more likely to pay attention to their own data than to national studies.
Fortunately, the incremental cost of evaluating any new education initiative has dropped dramatically in recent years, as a result of annual testing and investments in state and local longitudinal data systems. However, state and local governments don’t currently have the staff or structures to take advantage of that opportunity. We need new organizational structures for identifying schools or teachers implementing any given strategy (as well as for identifying statistical comparison groups pursuing other strategies). Those networks will be most valuable if they extend beyond a single school district or state. We also need to make it easier for state and local decision-makers to pilot interventions on a small scale and learn quickly whether those interventions are working. Finally, we need to create new venues for state and local leaders to make sense of the latest findings. Until we develop the capacity to systematically test our ideas for reform, we are doomed to continue reinventing the wheel. It is a system failure and it requires a systemic solution.
Currently, the Institute of Education Sciences provides $54 million per year to regional education labs. Those dollars are not having much impact. Instead of funding a national network of 10 regional education labs, suppose those funds were made available on a competitive basis to individual state agencies willing to help districts track the impacts of their own efforts, to identify matched comparison groups for their initiatives (using other students and schools in a state) and to generate reports on impacts. Districts that are deploying similar initiatives—such as professional development for the Common Core or educational software interventions—could band together and evaluate their efforts jointly, thereby increasing their collective statistical power. Every region may have its own regional education lab now—but the model is not working. Once we hit upon effective models for allowing leaders in a few states to test and learn from their own efforts, the model could be deployed nationally.
[i] John Arrowsmith, “Trial watch: Phase II failures: 2008-2010” Nature Reviews: Drug Discovery. 2011 May; 10(5):328-9.