Smart ways to cut back student testing: Data as a flashlight, not a strobe light

Backlash against testing in schools has been building for some time. President Obama’s open letter to America’s parents and teachers and Secretary Duncan’s Testing Action Plan demonstrate that even the most powerful proponents of test-based accountability need to clarify their positions and advocate for caution. These documents make the uncontroversial but important observations that standardized tests must be focused, limited, and of high quality. They also acknowledge that standardized tests and test-based indicators do not capture everything we care about in education.

What’s not said, but should be, is that the pendulum can swing too far in the other direction. There is a right way and a wrong way to dial down testing. The right way is to reduce the redundancy of local tests with high-quality standardized tests that are administered statewide in every grade in key subjects. The extent of duplicative testing was documented extensively by the Council of Great City Schools in a well-researched recent report. The wrong way is to adopt grade-span testing. Grade span testing is the reversion to only testing in “terminal” grades for elementary, middle, and high school, such as 5, 8, and 11; or testing in just one subject per year, so we test math in one grade and reading in another grade.

At first blush, grade span testing sounds like a reasonable idea. It is raised periodically by back-bench legislators or even high profile politicians like former President Bill Clinton. It dramatically reduces the number of tests students would take over the course of their education. Unfortunately, it also sacrifices the ability to measure students’ growth from year to year in a content area and dramatically reduces what we would know about what works in education.

Earlier this year, Chad Aldeman and Annie Hyslop gave a nice rundown of problems with grade span testing. Among other problems they highlighted, grade span testing fails to reduce the stakes on testing, and actually increases them, because you judge an entire school on one grade’s performance. It creates more perverse incentives to assign teachers strategically to or away from the few tested grades. Also, while it still allows schools to measure achievement gaps, it obscures any progress made toward closing those gaps.

We can take these arguments a step further and note that with grade span testing, any kind of statistical analysis of the impact of teachers, schools, or policies like professional development and educational technologies will become impossible or unreliable at best. Prior-year test scores often explain 55 to 75 percent of the variation in end of year test scores, which means that grade span tests, such as the National Assessment of Educational Progress (NAEP), are mostly telling us about learning that took place over many years. This makes them useless for localizing performance to the most recent year, which is generally how we interpret the scores.

The Data Quality Campaign likes to use the flashlight analogy of data in education: use data as a flashlight, not a hammer. If data can shine light on education problems, going from annual testing in every grade to grade span testing only is like replacing the flashlight with a strobe light.

Strobe lights are fun at parties, but you wouldn’t want to use one as your only light source in an operating room during delicate surgery.

To understand this problem it helps to understand the two main ways test data can be used.

The first is for diagnostic purposes, which is to place a student in the next course, assign remediation, or make a promotion decision. For diagnosis, a test score level at key moments in a child’s life is fine.

The other way test data are used is to measure the impact of teachers, principals, and programs on learning. For this important purpose, there is no way around it—
we need to measure growth from year to year. Educators need feedback on the year in question for a specific set of students, not a mishmash of historical trends and demographic differences between different cohorts of students.[1]

What about the problem that moving to grade span testing is meant to solve—the burden of tests? We can still do something about that within the guidelines issued by the President and the Secretary:

  • Address the root concerns about testing. If children are so nervous that tests are making them ill, we need adaptive tests that select the difficulty of the items to match the student’s level, so nobody gets too hard a test or too long a test.
  • Make sure tests are well aligned with content and standards. That way the testing process itself is part of the learning, not an interruption of that learning. Being tested on material is one way to reinforce it, especially if the student and teacher are given timely feedback.
  • Carefully consider the number, duration, and quality of tests. In particular, limit the number of non-standardized tests.[2] One of the shocking findings from the Council of Great City Schools study was the extent to which districts add their own tests to an existing comprehensive battery of state tests. If locally developed tests cannot be benchmarked and they have not been vetted for alignment with curriculum and standards, then they may be posing an unnecessary burden on students.

Let’s try to keep using data as a flashlight, not a hammer, for educational improvement, and keep a bright steady light on educational inequities, not a blinding flash.

[1] To measure the effectiveness of what happened in the previous period, we need a baseline. Knowledge is cumulative, so a test score level (as opposed to growth) reflects learning that has taken place over the past year but confounds this with learning in the prior year with a different teacher and the year before that with yet another teacher, and so on, possibly in different schools and a dizzying array of different interventions. Hence the focus on growth, not proficiency or test score levels. If we want to know if the school meals program and phys ed program are reducing obesity, we wouldn’t just weigh kids at the end of the year, we’d weigh them before and after.

[2] “Standardized tests” means those that are administered to all students with the same content under the same conditions. Some have used the term interchangeably with “multiple choice” tests, but multiple choice is just one item format. It is possible to have constructed response or even performance assessments that are standardized, just as multiple choice tests may not necessarily be standardized.