2017 Brown Center Report on American Education: International assessments

Editor's note:

The following is Part I of the 2017 Brown Center Report on American Education.

The United States participates in two international assessments: the Program for International Students Assessment (PISA) and the Trends in International Math and Science Study (TIMSS). The latest scores for both tests, conducted in 2015, were released in December 2016. PISA is given to age 15 students every three years and TIMSS to fourth and eighth graders every four years, meaning that the two tests coincide every 12 years. 2015 was such a year. In addition to allowing for a comparison of U.S. students to students in other countries, scores from PISA and TIMSS join the National Assessment of Educational Progress (NAEP) as providing the only valid estimates of U.S. national academic performance. The three tests use similar sampling designs, allowing for the results from tested samples to be generalized to an entire nation.

PISA was first administered in 2000. Reading literacy was the major subject assessed that year, followed by mathematics literacy in 2003 and science literacy in 2006. Table 1-1 displays U.S. scores for all PISA assessments between 2000 and 2015. Like TIMSS, PISA is scaled with an international mean of 500 and standard deviation of 100. U.S. scores have been flat over the 15 years of PISA. Reading literacy scores have hovered near the international mean, ranging between 495 (in 2003) and 504 (in 2000). Mathematics literacy scores have come in below the mean, with U.S. scores ranging between 470 and 487. The most recent mathematics score, 470, is the lowest for the U.S. in the test’s history. Science literacy scores have also fluctuated within a tight range, 489-502. The lack of an asterisk next to the 2015 scores means that none of the three PISA subjects registered a statistically significant change between the year when they were introduced and 2015. The 2015 math score is statistically significantly lower than the scores of 2009 and 2012, however.

Source: PISA 2015, Table I.4.4a (Reading); Table I.2.4a (Science); Table I.5.4a (Math).
Note: Technical difficulties rendered the U.S. 2006 scores in reading unreportable.

TIMSS fourth grade scores are shown in Table 1-2. Compared to the PISA scores, the U.S. performs better on TIMSS, both in terms of absolute levels and in gains over time. Math scores have stayed solidly above the international mean of 500 for the entire 20-year period of 1995-2015, and the latest score of 539 represents a statistically significant gain from the score of 518 in 1995. Science scores have held in a narrow channel, from 536 to 546. The gain of four points from 1995-2015 is not statistically significant.

Source: Highlights From TIMSS and TIMSS Advanced 2015, NCES, Figure 2a. (Math); Figure 6a. (Science)
Note: A “*” indicates a statistically significant change between the 1995 score and the 2015 score
(p<0.05).

On the eighth grade TIMSS, the U.S. notched statistically significant gains in both math and science over the course of TIMSS history (see Table 1-3). Math scores rose from 492 to 518. Science scores rose from 513 to 530. In contrast to PISA, the U.S. performs significantly above the international mean in math and science, with scores on the upswing. The PISA-TIMSS difference is especially surprising when one considers that the 15-year-olds taking the PISA exam and eighth graders taking TIMSS are not far apart in their school careers. About 70% of students in the PISA sample are in the fall semester of their sophomore year (10th grade) of high school. The TIMSS eighth grade sample is tested in the spring. For at least seven out of 10 examinees in the PISA sample, then, students have had the entire ninth grade, a couple of months of eighth and 10th grades, and two intervening summers since they were eligible for the TIMSS sample. That is not a lot of schooling to differentiate the two groups.

WHY THE HANDWRINGING? INTERNATIONAL COMPARISONS.

According to PISA, U.S. school performance has been flat for 15 years. TIMSS paints a rosier picture, with significant gains in fourth and eighth grade mathematics and eighth grade science. A flat to positive trend does not seem to justify handwringing over U.S. performance. Yet handwringing about how the U.S. does on international tests contends with baseball as a national pastime. Former Secretary of Education Arne Duncan called the 2009 PISA results “an absolute wakeup call,”¹ a comment that seemed strangely ahistorical at the time considering the U.S. scored much worse—11th out of 12 countries—in the First International Mathematics Study (FIMS), administered more than four decades earlier in 1964.²

The despair arises from how the U.S. compares to economically developed countries in Europe and Asia. Despite gains on TIMSS, the U.S. still scores far below the top performers. Singapore provides a good comparison because it scored the highest on the 2015 TIMSS math and science assessments at both grade levels (see Table 1-4). The TIMSS scale theoretically runs from 0-1,000, but as an empirical matter, scores range from the 300s to the 600s. As shown in Table 1-4, the U.S. lags Singapore by at least 44 points (fourth grade science)—and by 103 points in eighth grade math! That is a full standard deviation. The good news is that the U.S.-Singapore eighth grade math gap has narrowed since 1995 (when it was 117 points); the bad news is that it will take, at this pace, more than 140 years to close it completely.

Source: Highlights From TIMSS and TIMSS Advanced 2015, NCES, Figure 1a. (4th grade math);
Figure 1b. (8th grade math); Figure 5a. (4th grade science); Figure 5b (8th grade science).

Researchers shy away from using rankings in serious statistical analyses of test scores, but they are frequently used in political advocacy, most visibly in media headlines or sound bites. Rankings are simple to understand and conjure up the image of team standings in a sports league. They also can mislead. National scores on TIMSS and PISA are estimates, bounded by confidence intervals that reflect sampling error. Sampling error is not really “error” in the common sense of the word, but statistical noise introduced by inferring national scores from a random sample of test takers. Because every nation’s score is estimated in this way, it cannot be said with confidence that the rankings of participants with overlapping confidence intervals actually differ; they are considered statistically indistinguishable.

When new scores are released, the National Center of Educational Statistics (NCES) does its best to provide an accurate summary of U.S. rankings on PISA and TIMSS. It does so by describing the U.S. relative performance while taking statistical significance into consideration. Table 1-5 presents the 2015 PISA data in a similar fashion. The PISA scores are still disappointing, but not as dramatically as they initially seem. The reading scores, in particular, illustrate the nebulousness of rankings. The U.S. score in reading is tied for 23rd place, but its true ranking is more complicated than that. When statistical significance is taken into account, 14 systems scored higher than the U.S, 13 scored about the same, and 42 scored lower.

Note: The number in parentheses represents the official ranking of the U.S. on the assessment; a
“T” indicates that the U.S. tied for that ranking.

The U.S. also looks better on TIMSS (see Table 1-6) when scores are considered in this context. On the fourth grade TIMSS test in mathematics, the U.S. score is reported as tied for 13th place. More precisely, it scores below 10 systems, is statistically indistinguishable from the scores of nine systems, and is higher than the scores of 34 systems.³ In eighth grade math, the contrast with PISA’s math scores is provocative. Only eight systems outscore the U.S. on TIMSS, compared to 36 countries outscoring the U.S. on PISA math. Five countries that scored significantly lower than the U.S. on TIMSS—Australia, Sweden, Italy, Malta, and New Zealand—scored significantly higher than the U.S. on PISA.

Note: The number in parentheses represents the official ranking of the U.S. on the assessment; a
“T” indicates that the U.S. tied for that ranking.

NATIONAL TEST SCORE CORRELATIONS: TIMSS AND PISA

Previous Brown Center Reports have discussed key differences of TIMSS and PISA. TIMSS is grade-based, and PISA is age-based.⁴ TIMSS tests fourth and eighth graders, while PISA tests 15-year-olds. TIMSS is curriculum-based, meaning that it measures how well students have learned reading, mathematics, and science as presented in the school curriculum. PISA is a test of how well students can apply what they have learned to solve real world problems (hence “literacy” appended to the common labels for school subjects) and reflects what PISA’s expert committees believe students should know or need to know.⁵Despite these differences, TIMSS and PISA test scores are highly correlated. Table 1-7 displays the correlation coefficients for 2015 TIMSS and PISA scores. Not surprisingly, all three PISA tests are strongly correlated. The surprise is the magnitude of the correlation of PISA’s reading test with both math (0.91) and science (0.96). The two TIMSS tests are also highly correlated (0.92). And, as expected, PISA’s math scores are highly correlated with TIMSS math scores (0.93)—and PISA science scores with TIMSS science scores (0.94).

Note: N = 27 countries participating in both TIMSS 2015 (eighth grade) and PISA 2015.

Researchers have drawn different implications from these correlations. Economists Eric Hanushek and Ludger Woessmann concluded that the two tests measure “a common dimension of skills,” and that the scores can be aggregated to form a single national-level indicator of cognitive ability predicting economic growth.⁶ Psychologist Heiner Rindermann referred to that common dimension as a “g-factor,” standing for general intelligence. The term touches upon a longstanding debate in psychology. Simply put, the argument is about the extent to which human intelligence is general (smart people are smart about most things) or specific (smart people in math are not necessarily smart in interpreting poetry).⁷

Eckhard Klieme, an educational researcher with intimate knowledge of TIMSS and PISA, examined 2015 data for both TIMSS and PISA math assessments and analyzed the tests’ correlations. Klieme acknowledges that the tests’ cross-sectional scores are highly correlated but he also explores differences. He shows, for example, that the small differences between scores from the two tests can be explained by content coverage, the topics that math teachers reported being taught. Countries in which teachers reported teaching more of the TIMSS content scored higher on the TIMSS test than would be predicted from their PISA score. He also found that gain scores from the two tests were not as strongly correlated, with a 0.61 correlation of PISA and TIMSS gains from 2003 to 2015. That is strong but substantially weaker than the cross-sectional correlations for 2015.⁸

In the current study, a total of 22 systems participated in 2011 and 2015 TIMSS (eighth grade math) and 2012 and 2015 PISA. The correlation coefficient for their TIMSS and PISA math gains is 0.52. That, too, is much weaker than the cross-sectional correlations reported in Table 1-7.

CONCLUSION

On the 2015 PISA, the U.S. continued to register mediocre scores, as it has done since PISA began in 2000. The mathematics literacy score of 470 represented a statistically significant decline of 11 scale score points from the 481 scored in 2012, but U.S. performance in all three subjects—math, reading, and science—was not statistically significantly different from how the nation performed when each subject was first administered. TIMSS scores were more encouraging for the U.S., especially at the eighth grade level, where statistically significant gains have been made in both math and science since 1995. Significant gains on TIMSS have also been made in fourth grade math since 1995.

PISA and TIMSS scores are highly correlated. Cross-sectional test scores are often highly correlated when aggregated to the state or national level. It is important to note what these high correlations do not mean. They do not mean that the tests assess the same knowledge or skills; otherwise, countries are wasting a lot of time giving three PISA tests when the PISA reading literacy test is a good tool for measuring achievement in science (r = 0.96) and math (r = 0.91). Tenth graders in the U.S. take an advanced algebra course in mathematics. Imagine administering a reading test to see how well they learned algebra!

Casual observers of international tests should pay close attention to the trends on both tests. As shown above, PISA and TIMSS trend data are not as strongly correlated as the cross-sectional scores. The U.S. is showing steady progress on TIMSS but scores are flat on PISA—even declining in mathematics on the last two rounds.

Comparing the U.S. with other countries must be done with caution. Finland scored among the top countries on PISA in the early 2000s and became a famous destination for American “edutourists” eager to visit Finland’s schools. Since 2006, Finland’s PISA scores have declined dramatically. On TIMSS, fourth grade math scores for Finland (535) and the U.S. (539) are statistically indistinguishable. Speaking in Washington, D.C. in 2010, the Organisation for Economic Co-operation and Development (OECD) Secretary-General Angel Guerria called New Zealand a “top flier,” and one of the “strongest overall performers.” ⁹ And yet, since 1995, New Zealand has consistently scored either at comparable levels or below the U.S. on TIMSS—in both math and science and at both the fourth and eighth grade levels. More importantly, New Zealand’s TIMSS scores have been falling during the last several rounds of TIMSS, while the U.S. scores have been climbing. To get the most value from U.S. participation in PISA and TIMSS, policymakers—and the public—should pay close attention to the trends on both tests.