My quick thoughts on NAEP

By now you’ve heard the headlines–NAEP scores are down almost across the board, with the exception of 4th grade reading. Many states saw drops, some of them large. Here are my thoughts.

  1. These results are quite disappointing and shouldn’t be sugar-coated. Especially in mathematics, where we’ve seen literally two decades of uninterrupted progress, it’s (frankly) shocking to see declines like this. We’ve become almost expectant of the slow-but-steady increase in NAEP scores, and this release should shake that complacency. That said, we should not forget the two decades of progress when thinking about this two-year dip (nor should we forget that we still have yawning opportunity and achievement gaps and vast swaths of the population unprepared for success in college or career).
  2. Some folks are out with screeds blaming the results on particular policies that they don’t like (test-based accountability, unions, charters, Arne Duncan, etc.). Regardless of what the actual results were, they’d have made these same points. So these folks should be ignored in favor of actual research.[1] In general, people offering misNAEPery can only be of two types: (1) people who don’t know any better, or (2) people who know better but are shameless/irresponsible. Generally I would say anyone affiliated with a university who is blaming these results on particular policies at this point is highly likely to be in the latter camp.
  3. To a large extent, what actually caused these results (Common Core? Implementation? Teacher evaluation? Waivers? The economy? Opt-out? Something else?) is irrelevant in the court of public opinion. Perception is what matters. And the perception, fueled by charlatans and naifs, will be that Common Core is to blame. I wouldn’t be surprised if these results led to renewed repeal efforts for both the standards and the assessments in a number of states, even if there is, as yet, no evidence that these policies are harmful.

Overall, it’s a sad turn of events. And what makes it all the more sad is the knowledge that the results will be used in all kinds of perverse ways to score cheap political points and make policy decisions that may or may not help kids. We can’t do anything about the scores at this point. But we can do something about the misuse of the results. So, let’s.


[1] For instance, here’s a few summaries I’ve written on testing and accountability, and here’s a nice review chapter. These all conclude, rightly, that accountability has produced meaningful positive effects on student outcomes.

Advertisements

Friends don’t let friends misuse NAEP data

At some point the next few weeks, the results from the 2015 administration of the National Assessment of Educational Progress (NAEP) will be released. I can all but guarantee you that the results will be misused and abused in ways that scream misNAEPery. My warning in advance is twofold. First, do not misuse these results yourself. Second, do not share or promote the misuse of these results by others who happen to agree with your policy predilections. This warning applies of course to academics, but also to policy advocates and, perhaps most importantly of all, to education journalists.

Here are some common types of misused or unhelpful NAEP analyses to look out for and avoid. I think this is pretty comprehensive, but let me know in the comments or on Twitter if I’ve forgotten anything.

  • Pre-post comparisons involving the whole nation or a handful of individual states to claim causal evidence for particular policies. This approach is used by both proponents and opponents of current reforms (including, sadly, our very own outgoing Secretary of Education). Simply put, while it’s possible to approach causal inference using NAEP data, that’s not accomplished by taking pre-post differences in a couple of states and calling it a day. You need to have sophisticated designs that look at changes in trends and levels and that attempt to poke as many holes as possible in their results before claiming a causal effect.
  • Cherry-picked analyses that focus only on certain subjects or grades rather than presenting the complete picture across subjects and grades. This is most often employed by folks with ideological agendas (using 12th grade data, typically), but it’s also used by prominent presidential candidates who want to argue their reforms worked. Simply put, if you’re going to present only some subjects and grades and not others, you need to offer a compelling rationale for why.
  • Correlational results that look at levels of NAEP scores and particular policies (e.g., states that have unions have higher NAEP scores, states that score better on some reformy charter school index have lower NAEP scores). It should be obvious why correlations of test score levels are not indicative of any kinds of causal effects given the tremendous demographic and structural differences across states that can’t be controlled in these naïve analyses.
  • Analyses that simply point to low proficiency levels on NAEP (spoiler alert: the results will show many kids are not proficient in all subjects and grades) to say that we’re a disaster zone and a) the whole system needs to be blown up or b) our recent policies clearly aren’t working.
  • (Edit, suggested by Ed Fuller) Analyses that primarily rely on percentages of students at various performance levels, instead of using the scale scores, which are readily available and provide much more information.
  • More generally, “research” that doesn’t even attempt to account for things like demographic changes in states over time (hint: these data are readily available, and analyses that account for demographic changes will almost certainly show more positive results than those that do not).

Having ruled out all of your favorite kinds of NAEP-related fun, what kind of NAEP reporting and analysis would I say is appropriate immediately after the results come out?

  • Descriptive summaries of trends in state average NAEP scores, not just across a two NAEP waves but across multiple waves, grades, and subjects. These might be used to generate hypotheses for future investigation but should not (ever (no really, never)) be used naively to claim some policies work and others don’t.
  • Analyses that look at trends for different subgroups and the narrowing or closing of gaps (while noting that some of the category definitions change over time).
  • Analyses that specifically point out that it’s probably too early to examine the impact of particular policies we’d like to evaluate and that even if we could, it’s more complicated than taking 2015 scores and subtracting 2013 scores and calling it a day.

The long and the short of it is that any stories that come out in the weeks after NAEP scores are released should be, at best, tentative and hypothesis-generating (as opposed to definitive and causal effect-claiming). And smart people should know better than to promote inappropriate uses of these data, because folks have been writing about this kind of misuse for quite a while now.

Rather, the kind of NAEP analysis that we should be promoting is the kind that’s carefully done, that’s vetted by researchers, and that’s designed in a way that brings us much closer to the causal inferences we all want to make. It’s my hope that our work in the C-SAIL center will be of this type. But you can bet our results won’t be out the day the NAEP scores hit. That kind of thoughtful research designed to inform rather than mislead takes more than a day to put together (but hopefully not so much time that the results cannot inform subsequent policy decisions). It’s a delicate balance, for sure. But everyone’s goal, first and foremost, should be to get the answer right.

A (quick, direct, 2000 word) response to Tucker on testing

There’s been a bit of a kerfuffle recently in the edu-Twittersphere, since Marc Tucker suggested that civil rights leaders ought to reconsider their support for annual testing [1]. Kati Haycock and Jonah Edelman wrote impassioned responses, which Tucker has just dismissed as not responding to his substantive arguments. He ends with this paragraph:

The facts ought to count for something. What both of these critiques come down to is an assertion that I don’t have any business urging established leaders of the civil rights community to reconsider the issue, that I simply don’t understand the obvious—that annual accountability testing is essential to justice for poor and minority students, that anyone who thinks otherwise must be in the pocket of the teachers unions.  Well, it is not obvious. Indeed, all the evidence says it is not true. And anyone who knows me knows that I am in no one’s pocket. I know the leaders of the civil rights community to be people of great integrity.  They aren’t in anyone’s pocket, either. I think they want what is best for the people they represent. And I do not think that is annual testing.

I think Mr. Tucker greatly overstates the evidence in his initial post, so I’m going to do my best to give a very brief and direct response to the substantive arguments he makes there. I do this not to defend Haycock and Edelman (whom I do not really know), but to defend the policy, which I believe is unfairly maligned in Tucker’s posts.

Let me start by saying that I am generally in favor of annual testing, though I am probably not as fervid in that support as some others in the “reform” camp. I do not believe that annual accountability testing is essential to justice for poor and minority students, but I do think high-quality tests at reasonable intervals would almost certainly be beneficial to them.

Okay, here goes.

1) In his initial post, Marc Tucker says,

First of all, the data show that, although the performance of poor and minority students improved after passage of the No Child Left Behind Act, it was actually improving at a faster rate before the passage of the No Child Left Behind Act.

That link is to a NAEP report that indeed provides descriptive evidence supporting Tucker’s point. However, there are at least two peer-reviewed articles using NAEP data that show positive causal impacts of NCLB using high-quality quasi-experimental design, one on fourth grade math achievement only and the other on fourth and eighth grade math achievement and (suggestively) fourth grade reading. The latter is, to my eye, the most rigorous analysis that yet exists on this topic. There is a third article that uses cross-state NAEP data and does not find an impact, but again the most recent analysis by Wong seems to me to be the most methodologically sophisticated of the lot and, therefore, the most trustworthy. I think if Tucker wants to talk NAEP data, he has to provide evidence of this quality that supports his position of “no effect” (or even “harm,” as he appears to be suggesting). Is there a quality analysis using a strong design that shows a negative impact on the slope of achievement gains caused by NCLB? I do not know of one.

I should also note that there are beaucoup within-state studies of the impacts of accountability policies that use regression discontinuity designs and find causal impacts. For instance: in North Carolina, in Florida, and in Wisconsin. In short: I don’t see any way to read the causal literature on school accountability and conclude that it has negative impacts on student achievement. I don’t even see any way to conclude it has neutral impacts, given the large number of studies finding positive impacts relative to those with strong designs that find no impacts.

2) Next, Tucker says:

Over the 15-year history of the No Child Left Behind Act, there is no data to show that it contributed to improved student performance for poor and minority students at the high school level, which is where it counts.

Here I think Mark is moving the goalposts a bit. Is high school performance of poor and minority students the target? Then I guess we may as well throw out all the above-cited studies. I know of no causal studies that directly investigate the impact on this particular outcome, so I think the best he’s got is the NAEP trends. And sure, trends in high school performance are relatively flat.

I’m not one to engage in misNAEPery, however, so I wouldn’t make too much of this. Nor would I make too much of the fact that high school graduation rates have increased for all groups (meaning tons more low-performing students who in days gone by would have dropped out are still around to take the NAEP in 12th grade, among other things). But I would make quite a bit of the fact that the above-cited causal studies obviously also apply to historically underserved groups (that is, while they rarely directly test the impact of accountability on achievement gaps, they very often test the impacts for different groups and find that all groups see the positive effects). And I would also note some evidence from North Carolina of direct narrowing effects on black-white gaps.

3) Next, we have:

Many nations that have no annual accountability testing requirements have higher average performance for poor and minority students and smaller gaps between their performance and the performance of majority students than we do here in the United States.  How can annual testing be a civil right if that is so?

There’s not much to say about this. It’s not based on any study I know of, certainly none that would suggest a causal impact one way or the other. But he’s right that we’re relatively alone in our use of annual testing, and therefore that many higher-achieving nations don’t have annual testing. They also don’t have many other policies that we have, so I’m not sure what’s to be learned from this observation.

4) Now he moves on to claim:

It is not just that annual accountability testing with separate scores for poor and minority students does not help those students.  The reality is that it actually hurts them. All that testing forces schools to buy cheap tests, because they have to administer so many of them.  Cheap tests measure low-level basic skills, not the kind of high-level, complex skills most employers are looking for these days.  Though students in wealthy communities are forced to take these tests, no one in those communities pays much attention to them.  They expect much more from their students. It is the schools serving poor and minority students that feed the students an endless diet of drill and practice keyed to these low-level tests.  The teachers are feeding these kids a dumbed down curriculum to match the dumbed down tests, a dumbed down curriculum the kids in the wealthier communities do not get.

This paragraph doesn’t have links, probably because it’s not well supported by the existing evidence. Certainly you hear this argument all the time, and I believe it may well be true that schools serving poor kids have worse curricula or more perverse responses to tests (even some of my own work suggests different kinds of instructional responses in different kinds of schools). But even if we grant that this impact is real, the literature on achievement effects certainly does not suggest harm. And the fact that graduation rates are skyrocketing certainly does not suggest harm. If he’s going to claim harm, he has to provide clear, compelling evidence of harm. This ain’t it. And finally here, a small point. I hate when people say schools are “forced” to do anything. States, districts, and schools were not forced to buy bad tests before. They have priorities, and they have prioritized cheap and fast. That’s a choice, not a matter of force.

5) Next, Tucker claims:

Second, the teachers in the schools serving mainly poor and minority kids have figured out that, from an accountability standpoint, it does them no good to focus on the kids who are likely to pass the tests, because the school will get no credit for it. At the same time, it does them no good to focus on the kids who are not likely to pass no matter what the teacher does, because the school will get no credit for that either. As a result, the faculty has a big incentive to focus mainly on the kids who are just below the pass point, leaving the others to twist in the wind.

I am certainly familiar with the literature cited here, and I don’t dispute any of it. Quite the contrary, I acknowledge the conclusion that the students who are targeted by the accountability system see the greatest gains. This has been shown in many well-designed studies, such as here, here, here, and here. But this an argument about accountability policy design, not about annual testing. It simply speaks to the need for better accountability policies. For instance, suppose we thought the “bubble kids” problem was a bad one that needed solving. We could solve it tomorrow–simply create a system where all that matters is growth. Voila, no bubble kids! Of course there would be tradeoffs to that decision, so probably some mixture is better.

6) Then Tucker moves on to discuss the teaching force:

Not only is it true that annual accountability testing does not improve the performance of poor and minority students, as I just explained, but it is also true that annual accountability testing is making a major contribution to the destruction of the quality of our teaching force.

There’s no evidence for this. I know of not a single study that suggests that there is even a descriptive decrease in the quality of our teaching force in recent years. Certainly not one with a causal design of any kind that implicates annual accountability testing. And there is recent evidence that suggests improvements in the quality of the workforce, at least in certain areas such as New York and Washington.

7) Next, he takes on the distribution of teacher quality:

One of the most important features of these accountability systems is that they operate in such a way as to make teachers of poor and minority students most vulnerable.  And the result of that is that more and more capable teachers are much less likely to teach in schools serving poor and minority students.

It is absolutely true that the lowest quality teachers are disproportionately likely to serve the most disadvantaged students. But I know of not a single piece of evidence that this is caused by (or even made worse by) annual testing and accountability policies. My hunch is that this has always been true, but that’s just a hunch. If Tucker has evidence, he should provide it.

8) The final point is one that hits close to home:

Applications to our schools of education are plummeting and deans of education are reporting that one of the reasons is that high school graduates who have alternatives are not selecting teaching because it looks like a battleground, a battleground created by the heavy-handed accountability systems promoted by the U.S. Department of Education and sustained by annual accountability testing.

As someone employed at a school of education, I can say the first clause here is completely true. And we’re quite worried about it. But again, I know of not a single piece of even descriptive evidence that suggests this is due to annual accountability testing. Annual accountability testing has been around for well over a decade. Why would the impact be happening right now?

I think these are the main arguments in Tucker’s piece, and I have provided evidence or argumentation here that suggests that not one of them is supported by the best academic research that exists today. Perhaps the strongest argument of the eight is the second one, but again I know of no quality research that attributes our relative stagnation on 12th grade NAEP to annual accountability testing. That does not mean Tucker is wrong. But it does mean that he is the one who should bear the burden of providing evidence to support his positions, not Haycock and Edelman. I don’t believe he can produce such evidence, because I don’t believe it exists.


[1] I think it’s almost universally a bad idea to tell civil rights leaders what to do.

Research you should read – on the impact of NCLB

This is the first in what will be a mainstay of this blog–a discussion of a recent publication (peer-reviewed or not) that I think more folks should be reading and citing. Today’s article is both technically impressive and substantively important. It has the extremely un-thrilling name “Adding Design Elements to Improve Time Series Designs: No Child Left Behind as an Example of Causal Pattern-Matching,” and it appears in the most recent issue of the Journal for Research on Educational Effectiveness (the journal of the excellent SREE organization) [1].

The methodological purpose of this article is to add “design elements” to the Comparative Interrupted Time Series design (a common quasi-experimental design used to evaluate the causal impact of all manner of district- or state-level policies). The substantive purpose of this article is to identify the causal impact of NCLB on student achievement using NAEP data. While the latter has already been done (see for instance Dee and Jacob), this article strengthens Dee and Jacob’s findings through their design elements analysis.

In essence, what design elements bring to the CITS design for evaluating NCLB is a greater degree of confidence in the causal conclusions. Wong and colleagues, in particular, demonstrate NCLB’s impacts in multiple ways:

  • By comparing public and Catholic schools.
  • By comparing public and non-Catholic private schools.
  • By comparing states with high proficiency bars and low ones.
  • By using tests in 4th and 8th grade math and 4th grade reading.
  • By using Main NAEP and long-term trend NAEP.
  • By comparing changes in mean scores and time-trends.

The substantive findings are as follows:

1. We now have national estimates of the effects of NCLB by 2011.

2. We now know that NCLB affected eighth-grade math, something not statistically confirmed in either Wong, Cook, Barnett, and Jung (2008) or Dee and Jacob (2011) where positive findings were limited to fourth-grade math.

3. We now have consistent but statistically weak evidence of a possible, but distinctly smaller, fourth-grade reading effect.

4. Although it is not clear why NCLB affected achievement, some possibilities are now indicated.

These possibilities include a) consequential accountability, b) higher standards, and c) the combination of the two.

So why do I like this article so much? Well, of course, one reason is because it supports what I believe to be the truth about consequential standards-based accountability–that it has real, meaningfully large impacts on student outcomes [2][3]. But I also think this article is terrific because of its incredibly thoughtful design and execution and its clever use of freely available data. Regardless of one’s views on NCLB, this should be an article for policy researchers to emulate. And that’s why you should read it.


[1] This article, like many articles I’ll review on this blog, is paywalled. If you want a PDF and don’t have access through your library, send me an email.

[2] See this post for a concise summary of my views on this issue.

[3] Edited to add: I figured it would be controversial to say that I liked an article because it agreed with my priors. Two points. First, I think virtually everyone prefers research that agrees with their priors, so I’m merely being honest; deal with it. Second, as Sherman Dorn points out via Twitter, this is conjunctional–I like it because it’s a very strong analysis AND it agrees with my priors. If it was a shitty analysis that agreed with my priors, I wouldn’t have blogged about it.

A note on Simpson’s Paradox and NAEP

A couple of weeks ago, before yours truly joined the blogosphere, the results for NAEP history, geography, and civics were released. Journalists and advocates around the nation reacted with their usual swift condemnation, noting the “flatlining”, “stagnant” performance. And it’s true, overall average scores on the newly released tests had not changed since their previous administration.

A few wise individuals, however, noticed that the scores had continued to increase when broken down by subgroup. Chad Aldeman penned the best defense, invoking Simpson’s Paradox to conclude that achievement is rising, and not by a trivial amount. In this case, Simpson’s Paradox means that the gains by individual subgroups (every subgroup is gaining in these subjects, and the largest gains are going to the historically most underserved groups) are masked when calculating overall averages because the typically lower-performing subgroups are increasing in numbers.

Jay Greene shot back in the comments section of Chad’s piece, arguing that Simpson’s Paradox was not an appropriate excuse here, because minority students are less difficult to educate now than minority students were 30 or 40 years ago, so making comparisons within groups is not necessarily appropriate.

I will actually take a middle ground here and say there is an element of truth to both arguments. This is because, in evaluating whether it’s better to focus on individual subgroups or the overall average in a case of Simpson’s Paradox, I find it useful to consider what the question of interest is.

As an example, consider the case of two airlines (American and United) operating at two airports (O’Hare and LAX). United flies 100 flights out of each airport with a 55% on-time rating from O’Hare and an 85% rating from LAX (thus, 70% overall). American flies 200 flights out of O’hare with a 60% on-time rating and 50 flights out of LAX with a 90% (thus, 66% overall). Now, if you were buying a ticket based on the aggregate statistics, you would choose United, because it has a higher overall on-time rate. But the overall average in this case is completely useless; it only applies to you if you pick your flights (including your departing airports) completely at random. If, instead, you pick your flights like a normal person by first choosing a departing airport and then choosing an airline, you are always better off choosing American. So in this case, the “subgroup” question is by far the more interesting one, and the “average” question is misleading and worthless.

To me, the primary question of interest with respect to NAEP is whether a given kid is likely to be better off now than he or she would have been 20 years ago. This is a subgroup question–we want to compare each kid to himself if he’d only been born 20 years earlier. Here, the answer is very clearly yes (with the possible exception of kids in extreme poverty). For all subgroups, NAEP achievement in all subjects continues to increase, as do high school graduation rates.

However, I can see the argument that the main question of interest is how the nation as a whole is doing, in which case it’s not overly relevant if the subgroups are making gains but the national average is not. The argument here basically says “the population is what it is, and we have to deal with that.”

Regardless of one’s view on Simpson’s Paradox in this particular case, I actually remain stunned and impressed by our students’ performance in subjects like geography and civics. Given that these are non-tested NCLB subjects (and thus have certainly seen reduced emphasis in classrooms), I find it remarkable that performance has not only not decreased, but actually has continued to tick up for all kinds of kids. This story, the nuanced version that includes attention to subgroups, is one that certainly needs to be told more often.