A note on Simpson’s Paradox and NAEP

A couple of weeks ago, before yours truly joined the blogosphere, the results for NAEP history, geography, and civics were released. Journalists and advocates around the nation reacted with their usual swift condemnation, noting the “flatlining”, “stagnant” performance. And it’s true, overall average scores on the newly released tests had not changed since their previous administration.

A few wise individuals, however, noticed that the scores had continued to increase when broken down by subgroup. Chad Aldeman penned the best defense, invoking Simpson’s Paradox to conclude that achievement is rising, and not by a trivial amount. In this case, Simpson’s Paradox means that the gains by individual subgroups (every subgroup is gaining in these subjects, and the largest gains are going to the historically most underserved groups) are masked when calculating overall averages because the typically lower-performing subgroups are increasing in numbers.

Jay Greene shot back in the comments section of Chad’s piece, arguing that Simpson’s Paradox was not an appropriate excuse here, because minority students are less difficult to educate now than minority students were 30 or 40 years ago, so making comparisons within groups is not necessarily appropriate.

I will actually take a middle ground here and say there is an element of truth to both arguments. This is because, in evaluating whether it’s better to focus on individual subgroups or the overall average in a case of Simpson’s Paradox, I find it useful to consider what the question of interest is.

As an example, consider the case of two airlines (American and United) operating at two airports (O’Hare and LAX). United flies 100 flights out of each airport with a 55% on-time rating from O’Hare and an 85% rating from LAX (thus, 70% overall). American flies 200 flights out of O’hare with a 60% on-time rating and 50 flights out of LAX with a 90% (thus, 66% overall). Now, if you were buying a ticket based on the aggregate statistics, you would choose United, because it has a higher overall on-time rate. But the overall average in this case is completely useless; it only applies to you if you pick your flights (including your departing airports) completely at random. If, instead, you pick your flights like a normal person by first choosing a departing airport and then choosing an airline, you are always better off choosing American. So in this case, the “subgroup” question is by far the more interesting one, and the “average” question is misleading and worthless.

To me, the primary question of interest with respect to NAEP is whether a given kid is likely to be better off now than he or she would have been 20 years ago. This is a subgroup question–we want to compare each kid to himself if he’d only been born 20 years earlier. Here, the answer is very clearly yes (with the possible exception of kids in extreme poverty). For all subgroups, NAEP achievement in all subjects continues to increase, as do high school graduation rates.

However, I can see the argument that the main question of interest is how the nation as a whole is doing, in which case it’s not overly relevant if the subgroups are making gains but the national average is not. The argument here basically says “the population is what it is, and we have to deal with that.”

Regardless of one’s view on Simpson’s Paradox in this particular case, I actually remain stunned and impressed by our students’ performance in subjects like geography and civics. Given that these are non-tested NCLB subjects (and thus have certainly seen reduced emphasis in classrooms), I find it remarkable that performance has not only not decreased, but actually has continued to tick up for all kinds of kids. This story, the nuanced version that includes attention to subgroups, is one that certainly needs to be told more often.

Advertisement

2 thoughts on “A note on Simpson’s Paradox and NAEP

  1. Interesting post. Have the definitions of each of these subgroups changed over time? That is, is an ELL or Sp needs student classified in the same way? Also, is there consistency in the way race is recorded over time?

    Like

  2. Some of the definitions have changed over time, but in general the NAEP data explorer doesn’t let you do comparisons if the definitions have changed. So all of the comparisons discussed here are based on definitions that have not changed.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s