Friends don’t let friends misuse NAEP data

At some point the next few weeks, the results from the 2015 administration of the National Assessment of Educational Progress (NAEP) will be released. I can all but guarantee you that the results will be misused and abused in ways that scream misNAEPery. My warning in advance is twofold. First, do not misuse these results yourself. Second, do not share or promote the misuse of these results by others who happen to agree with your policy predilections. This warning applies of course to academics, but also to policy advocates and, perhaps most importantly of all, to education journalists.

Here are some common types of misused or unhelpful NAEP analyses to look out for and avoid. I think this is pretty comprehensive, but let me know in the comments or on Twitter if I’ve forgotten anything.

  • Pre-post comparisons involving the whole nation or a handful of individual states to claim causal evidence for particular policies. This approach is used by both proponents and opponents of current reforms (including, sadly, our very own outgoing Secretary of Education). Simply put, while it’s possible to approach causal inference using NAEP data, that’s not accomplished by taking pre-post differences in a couple of states and calling it a day. You need to have sophisticated designs that look at changes in trends and levels and that attempt to poke as many holes as possible in their results before claiming a causal effect.
  • Cherry-picked analyses that focus only on certain subjects or grades rather than presenting the complete picture across subjects and grades. This is most often employed by folks with ideological agendas (using 12th grade data, typically), but it’s also used by prominent presidential candidates who want to argue their reforms worked. Simply put, if you’re going to present only some subjects and grades and not others, you need to offer a compelling rationale for why.
  • Correlational results that look at levels of NAEP scores and particular policies (e.g., states that have unions have higher NAEP scores, states that score better on some reformy charter school index have lower NAEP scores). It should be obvious why correlations of test score levels are not indicative of any kinds of causal effects given the tremendous demographic and structural differences across states that can’t be controlled in these naïve analyses.
  • Analyses that simply point to low proficiency levels on NAEP (spoiler alert: the results will show many kids are not proficient in all subjects and grades) to say that we’re a disaster zone and a) the whole system needs to be blown up or b) our recent policies clearly aren’t working.
  • (Edit, suggested by Ed Fuller) Analyses that primarily rely on percentages of students at various performance levels, instead of using the scale scores, which are readily available and provide much more information.
  • More generally, “research” that doesn’t even attempt to account for things like demographic changes in states over time (hint: these data are readily available, and analyses that account for demographic changes will almost certainly show more positive results than those that do not).

Having ruled out all of your favorite kinds of NAEP-related fun, what kind of NAEP reporting and analysis would I say is appropriate immediately after the results come out?

  • Descriptive summaries of trends in state average NAEP scores, not just across a two NAEP waves but across multiple waves, grades, and subjects. These might be used to generate hypotheses for future investigation but should not (ever (no really, never)) be used naively to claim some policies work and others don’t.
  • Analyses that look at trends for different subgroups and the narrowing or closing of gaps (while noting that some of the category definitions change over time).
  • Analyses that specifically point out that it’s probably too early to examine the impact of particular policies we’d like to evaluate and that even if we could, it’s more complicated than taking 2015 scores and subtracting 2013 scores and calling it a day.

The long and the short of it is that any stories that come out in the weeks after NAEP scores are released should be, at best, tentative and hypothesis-generating (as opposed to definitive and causal effect-claiming). And smart people should know better than to promote inappropriate uses of these data, because folks have been writing about this kind of misuse for quite a while now.

Rather, the kind of NAEP analysis that we should be promoting is the kind that’s carefully done, that’s vetted by researchers, and that’s designed in a way that brings us much closer to the causal inferences we all want to make. It’s my hope that our work in the C-SAIL center will be of this type. But you can bet our results won’t be out the day the NAEP scores hit. That kind of thoughtful research designed to inform rather than mislead takes more than a day to put together (but hopefully not so much time that the results cannot inform subsequent policy decisions). It’s a delicate balance, for sure. But everyone’s goal, first and foremost, should be to get the answer right.


Monday Morning Alignment Critiques

As I’ve written about already, one of my main research interests these days is the quality and alignment of textbooks to standards. My recent work on this issue is among the first peer-reviewed studies (if not the first) to employ a widely-used alignment technique to rate the alignment of textbooks with standards. While I think the approach I use is great (or else I wouldn’t do it), it’s certainly not perfect. There are many ways to determine alignment; all of them are flawed.

Of course, there are others in this space as well. The two biggest players, by far, are Bill Schmidt and EdReports [1]. Both are well funded and have released ratings of textbook alignment. EdReports’ ratings have recently come under fire from many directions, including both publishers and, now, the National Council of Teachers of Mathematics. NCTM released a pretty scathing open letter, which was covered by Liana Heitin over at EdWeek, accusing EdReports of errors and methodological flaws.

I have three general comments about this response by NCTM.

The first is that there is no one right way to do an alignment analysis. While the EdReports “gateway” approach might not have been the method I’d have chosen, it seems to me to be a perfectly reasonable way to constrain the (very arduous) task of reading and rating a huge pile of textbooks. Perhaps they’d have gotten somewhat different results with a different method; who knows? But their results are generally in line with mine and Bill’s, so I doubt highly that their overall finding of mediocre alignment is driven by the method.

The second is that we need to always consider the other options when we’re evaluating criticisms like this. What kind of alignment information is out there currently? Basically you’ve got my piddly study of 7 books, Bill’s larger database, and EdReports [2]. Otherwise you have to either trust what the publisher says or come up with your own ratings. In that context, it’s not clear to me that EdReports is any worse than what else is available. And EdReports is almost certainly better than districts doing their own home-cooked analyses. The more information the better, I say.

The third point, and by far the most important, is that this kind of criticism is really not helpful in a time when schools and districts are desperate for quality information about curriculum materials. Schools and districts have been making decisions about these materials for years with virtually no information. Now we finally have some information (imperfect though it may be) and we’re nit-picking the methodological details? This completely misses the forest for the trees. If NCTM wants to be a leader here, they should be out in front on this issue offering their own evaluations to schools and districts. Otherwise it’s left to folks like EdReports or me to do what we can to fill this yawning gap by providing information that was needed years ago. Monday morning alignment critiques aren’t helpful. Actually getting in the game and giving educators information–that’d be a useful contribution.

[1] For the record, I participated in the webinar where EdReports’ results were released, but I have not been paid by them and don’t currently do any work with them.

[2] There’s probably other stuff out there I don’t know about.

The Impact of Common Core

It’s pretty much always a good idea to read Matt Di Carlo over at the Shankerblog. His posts are always thoughtful and middle-of-the-road, a refreshing antidote to usual advocacy blather. His recent post about the purpose and potential impact of the Common Core is no exception.

Here’s where I agree with Matt:

  • That standards alone are probably unlikely to have large impacts on student achievement.
  • That advocates of the standards do a disservice when they project such claims.
  • That making definitive statements about the impact of Common Core on student outcomes will be hard (and, I would say, causal research is almost certainly not worth doing at this point in the implementation process).

Here’s where I don’t agree with Matt. I don’t agree that standards are not meant to boost achievement. I believe that they most certainly are meant to boost achievement. Standards are intended to improve the likelihood that students will have access to a quality curriculum and, through that, learn more and better stuff. It’s a pretty straightforward theory of action, actually. Something like:

Standards (+ other policies) –> Improved, aligned instruction –> Student achievement

And I think we have pretty decent evidence on this theory of action. For instance, my work and the work of others makes it reasonably clear that standards can affect what and how teachers teach (albeit imperfectly). There’s a great deal of research on the very commonsense notion that what and how teachers teach affects what students learn (my study from last year notwithstanding). We don’t have studies that I’m aware of that draw the causal arrow directly from standards to achievement, but given the evidence on the indirect paths I believe this may well be due to the weaknesses of the data and designs more than the lack of an effect.

That said, I fully echo Matt’s concerns about overstating the case for quality standards, and I hope advocates take this warning to heart. What we need is not over-hyped claims and shoddy analyses designed to show positive impacts [1]. What we need at this point is thoughtful studies of implementation and cautious, tentative investigations of early effects. These are just the kind of studies that we are seeking in the “special issue” of AERA Open that I’m curating. My hope is that this issue will provide some of the first quality evidence about implementation and effects, in order to inform course corrections and begin building the evidence base about this reform.

[1] Edited to add: We also don’t need garbage studies by Common Core opponents using equally shoddy methods to conclude the standards aren’t working.

Research you should read – on the impact of NCLB

This is the first in what will be a mainstay of this blog–a discussion of a recent publication (peer-reviewed or not) that I think more folks should be reading and citing. Today’s article is both technically impressive and substantively important. It has the extremely un-thrilling name “Adding Design Elements to Improve Time Series Designs: No Child Left Behind as an Example of Causal Pattern-Matching,” and it appears in the most recent issue of the Journal for Research on Educational Effectiveness (the journal of the excellent SREE organization) [1].

The methodological purpose of this article is to add “design elements” to the Comparative Interrupted Time Series design (a common quasi-experimental design used to evaluate the causal impact of all manner of district- or state-level policies). The substantive purpose of this article is to identify the causal impact of NCLB on student achievement using NAEP data. While the latter has already been done (see for instance Dee and Jacob), this article strengthens Dee and Jacob’s findings through their design elements analysis.

In essence, what design elements bring to the CITS design for evaluating NCLB is a greater degree of confidence in the causal conclusions. Wong and colleagues, in particular, demonstrate NCLB’s impacts in multiple ways:

  • By comparing public and Catholic schools.
  • By comparing public and non-Catholic private schools.
  • By comparing states with high proficiency bars and low ones.
  • By using tests in 4th and 8th grade math and 4th grade reading.
  • By using Main NAEP and long-term trend NAEP.
  • By comparing changes in mean scores and time-trends.

The substantive findings are as follows:

1. We now have national estimates of the effects of NCLB by 2011.

2. We now know that NCLB affected eighth-grade math, something not statistically confirmed in either Wong, Cook, Barnett, and Jung (2008) or Dee and Jacob (2011) where positive findings were limited to fourth-grade math.

3. We now have consistent but statistically weak evidence of a possible, but distinctly smaller, fourth-grade reading effect.

4. Although it is not clear why NCLB affected achievement, some possibilities are now indicated.

These possibilities include a) consequential accountability, b) higher standards, and c) the combination of the two.

So why do I like this article so much? Well, of course, one reason is because it supports what I believe to be the truth about consequential standards-based accountability–that it has real, meaningfully large impacts on student outcomes [2][3]. But I also think this article is terrific because of its incredibly thoughtful design and execution and its clever use of freely available data. Regardless of one’s views on NCLB, this should be an article for policy researchers to emulate. And that’s why you should read it.

[1] This article, like many articles I’ll review on this blog, is paywalled. If you want a PDF and don’t have access through your library, send me an email.

[2] See this post for a concise summary of my views on this issue.

[3] Edited to add: I figured it would be controversial to say that I liked an article because it agreed with my priors. Two points. First, I think virtually everyone prefers research that agrees with their priors, so I’m merely being honest; deal with it. Second, as Sherman Dorn points out via Twitter, this is conjunctional–I like it because it’s a very strong analysis AND it agrees with my priors. If it was a shitty analysis that agreed with my priors, I wouldn’t have blogged about it.

Everyone’s got an opinion about everything

Okay, not everyone. And not everything. But surprisingly many people about surprisingly many things. This will be the first of many posts about public opinion polling data, something in which I have increasing interest (even if little technical expertise).

Today’s interesting nugget comes via NPR, which reports on a recent little exercise done by Public Policy Polling. It seems that after a random tweet from a TCU professor, PPP polled voters and found that they had stunningly negative views of this person (whom they could not possibly have heard of)–3% favorable to 20% unfavorable. The money quote:

The big lesson for Farris, who is already thinking about how she’ll work this experiment into her next political science class, is in “pseudo-opinions.”

“People will offer an opinion when they don’t actually have one,” she said. “There is social pressure to answer, and give some type of opinion, whether it’s right or wrong.”

There is a recent boom in public opinion polls on education, and I am willing to bet many of the same trends come into play. Despite their general lack of knowledge about education issues, Americans want to give their opinions. In particular, for example, polls suggest that Americans pretty strongly support local control and teachers while also supporting weakened labor protections and testing. I’m sure some of this support is real. But I’ll bet a good chunk of it is just pseudo-opinions. Hopefully well-crafted polling and research can be used to help discern the difference.