Do the content and quality of state tests matter?

Over at Ahead of the Heard, Chad Aldeman has written about the recent Mathematica study, which found that PARCC and MCAS were equally predictive of early college success. He essentially argues that if all tests are equally predictive, states should just choose the cheapest bargain-basement test, content and quality be damned. He offers a list of reasons, which you’re welcome to read.

As you’d guess, I disagree with this argument. I’ll offer a list of reasons of my own here.

  1. The most obvious point is that we have reasonable evidence that testing drives instructional responses to standards. Thus, if the tests used to measure and hold folks/schools accountable are lousy and contain poor quality tasks, we’ll get poor quality instruction as well. This is why many folks are thinking these days that better tests should include tasks that are much closer to the kinds of things we want kids to actually be doing. In that case, “teaching to the test” becomes “good teaching.” May be a pipe dream, but that’s something I commonly hear.
  2. A second fairly obvious point is that switching to a completely unaligned test would end any possible notion that the tests could provide feedback to teachers about what they should be doing differently/better. Certainly we can all argue that current test results are provided too late to be useful–though smart testing vendors ought to be working on this issue as hard as possible–but if the test is in no way related to what teachers are supposed to be teaching, it’s definitely useless to them as a formative measure.
  3. Chad’s analysis seems to prioritize predictive validity–how well do results from the test predict other desired outcomes–over all the other types of validity evidence. It’s not clear to me why we should prefer predictive validity (especially when we already have evidence that GPAs do better at that than most standardized tests, though SAT/ACT adds a little) over, say, content-related validity. Don’t we first and foremost want the test to be a good measure of what students were supposed to have learned in the grade? More generally, I think it makes more sense to have different tests for different purposes, rather than piling all the purposes into a single test.
  4. Certainly if the tests are going to have stakes attached to them, the courts require a certain level of content validity (or what they’ve called instructional validity). See Debra P. v. Turlington. If a kid’s going to be held accountable, they need to have had the opportunity to learn what was on the test. If the test is the SAT, that’s probably not going to happen.

Anyway, take a look at the Mathematica report (you should anyway!) and Chad’s post and let me know what you think.

Some quick thoughts on opt out

In general, I have not opined much on the subject of “opt out,” for a number of reasons. First, there’s little/no good data or research on the topic, so my opinions can’t be as informed as I would typically like them to be. Second, I don’t know that I have much to add on the issue (and yet I’m about to give my two cents). Third, it’s a trend that actively worries me as someone who believes research clearly shows that tests and accountability have been beneficial overall. I don’t really see much policymakers can do to stop this trend short of requiring public school students to test [1].

Despite my best efforts to avoid the subject, over on Twitter, former MCPS Superintendent Joshua Starr asked me what I think of this EdWeek commentary on opt out. Here are some excerpts of their argument and my reactions.

First, the title is “Test-taking ‘compliance’ does not ensure equity.” Probably the authors did not write this title, but it’s a very weak straw man. I know of few/any folks who believe that test-taking compliance ensures equity. I certainly don’t believe that. I do believe having good data can help equity, but it certainly doesn’t ensure it.

Some parents have elected to opt their children out of the annual tests as a message of protest, signaling that a test score is not enough to ensure excellence and equity in the education of their children. Parents, they insist, have a right to demand an enriched curriculum that includes the arts, civics, and lab sciences, and high-quality schools in their neighborhoods.

I don’t have good evidence on this (I don’t think anyone yet does, but hopefully several savvy doctoral students are work on this topic), but my very strong sense is that the folks opting out of tests are not typically doing it as an equity protest. Everything I’ve seen and heard so far says this is largely, but not exclusively, a white, upper-middle class, suburban/rural phenomenon [EDITED TO ADD: Matt Chingos has done a preliminary analysis of this issue and largely agrees with this characterization: http://www.brookings.edu/research/papers/2015/06/18-chalkboard-who-opts-out-chingos%5D. My conversations with educators in California, for instance, suggest that the high rates of opt-out in high schools in some affluent areas are because the exam was seen as meaningless and interfering with students’ abilities to prepare for other exams that actually matter to students (e.g., APs, SAT).

Since it was signed into law in 2002, No Child Left Behind has done little to advance the educational interests of our most disadvantaged students. What’s more, the high-stakes-testing climate that NCLB created has also been connected to increased discipline rates for students of color and students with disabilities.

I think the first sentence there is not correct–as I showed in the previous post, there’s evidence that achievement has increased due to NCLB for all groups, including the most disadvantaged (but not much evidence it has narrowed gaps). I’m not aware of well designed research showing the latter claim, but that’s not my area. Regardless, as I also discussed in the last post, sweeping claims of harm to disadvantaged students are hard to square with empirical evidence on outcomes such as test scores and graduation rates.

And even after these tests reveal large outcome gaps, schools serving poor children of color remain underfunded and are more likely to be labeled failing. Most states have done nothing to intervene effectively in these schools, even when state officials have taken over school districts. Moreover, despite NCLB’s stated goal of closing the achievement gap, wide disparities in academic outcomes persist.

I think this is mostly true, though of course it depends on state (some states are much more adequate and equitable in their funding than others). And the lack of intervention in low-performing schools really is about a lack of effective intervention, though I’d be very curious what interventions these authors would recommend. It’s true that achievement gaps persist, though I believe racial (but not income) gaps are about as small now as they’ve ever been.

We are not opposed to assessments, especially when they are used for diagnostic purposes to support learning. But the data produced by annual standardized tests are typically not made available to teachers until after the school year is over, thereby making it impossible to use the information to respond to student needs.

Some of the new state tests get data back faster. For instance, some California results were made available to teachers before the end of the year. In general I think it’s a bad idea to heap too many different goals for a single test. It’s not clear to me that we always want our accountability test to also be our formative, immediate feedback test–those probably should be different tests. But that doesn’t necessarily obviate the need for an external accountability test.

Thus, students of color are susceptible to all of the negative effects of the annual assessments, without any of the positive supports to address the learning gaps. When testing is used merely to measure and document inequities in outcomes, without providing necessary supports, parents have a right to demand more.

Again I think the intention of both the original NCLB and the waivers was that, in the early years of school “failure” students would be provided with additional supports and options (e.g., through supplemental education services and public school choice) to improve. Those turn out not to have worked, and perhaps future supports will not either, but it’s not necessarily for lack of effort. I’m curious what specific supports these authors would advocate, bearing in mind the intense hostility among half our nation to raising any additional funds for schools or anything else.

The civil rights movement has never supported compliance with unjust laws and policies. Rather, it has always worked to challenge them and support the courageous actions of those willing to resist. As young people and their allies protest throughout the country against police brutality, demanding that “black lives matter,” we are reminded that the struggle for justice often forces us to hold governments and public officials accountable to reject the status quo. Today’s status quo in education is annual assessments that provide no true path toward equity or excellence.

This strikes me as a stretch, though I agree with the first half of it. I’m not sure the “black lives matter” movement was really about holding the government accountable to reject the status quo, as much as it was about holding both government and individuals accountable for centuries of unjust laws and actions (but this is not remotely my area).

The anti-testing movement will not be intimidated, nor is it going away.

I think that’s right. Though reducing or eliminating teacher accountability based on state tests would probably at least reduce the extent to which the unions are actively encouraging opt-outs.

Some may choose to force districts to adopt a more comprehensive “dashboard” accountability system with multiple measures. Others may push districts to engage in biennial or grade-span testing, and still others may choose to opt out. What remains clear is that parents want more than tests to assess their children’s academic standing and, as a result, are choosing to opt out of an unjust, ineffective policy.

With respect to the first sentence, some states did this (and all states had the opportunity to do this in their waivers). With respect to the sentence, it’s not clear to me how biennial or grade-span testing is any more “just” than yearly testing. Perhaps if these authors stated what they think is the optimal testing regimen from a “justice” perspective, that would help.

So, I don’t think it’s an especially convincing argument. But I don’t know that the pro-opt-out movement really needs convincing arguments. If parents have the right to opt their kids out of tests, at least some of them will do so. I suspect this will lead to increased inequity, but that’s an empirical question for another day.


[1] Were I omnipotent, I would enact that rule, and I’d also require private and homeschool kids to test.

A (quick, direct, 2000 word) response to Tucker on testing

There’s been a bit of a kerfuffle recently in the edu-Twittersphere, since Marc Tucker suggested that civil rights leaders ought to reconsider their support for annual testing [1]. Kati Haycock and Jonah Edelman wrote impassioned responses, which Tucker has just dismissed as not responding to his substantive arguments. He ends with this paragraph:

The facts ought to count for something. What both of these critiques come down to is an assertion that I don’t have any business urging established leaders of the civil rights community to reconsider the issue, that I simply don’t understand the obvious—that annual accountability testing is essential to justice for poor and minority students, that anyone who thinks otherwise must be in the pocket of the teachers unions.  Well, it is not obvious. Indeed, all the evidence says it is not true. And anyone who knows me knows that I am in no one’s pocket. I know the leaders of the civil rights community to be people of great integrity.  They aren’t in anyone’s pocket, either. I think they want what is best for the people they represent. And I do not think that is annual testing.

I think Mr. Tucker greatly overstates the evidence in his initial post, so I’m going to do my best to give a very brief and direct response to the substantive arguments he makes there. I do this not to defend Haycock and Edelman (whom I do not really know), but to defend the policy, which I believe is unfairly maligned in Tucker’s posts.

Let me start by saying that I am generally in favor of annual testing, though I am probably not as fervid in that support as some others in the “reform” camp. I do not believe that annual accountability testing is essential to justice for poor and minority students, but I do think high-quality tests at reasonable intervals would almost certainly be beneficial to them.

Okay, here goes.

1) In his initial post, Marc Tucker says,

First of all, the data show that, although the performance of poor and minority students improved after passage of the No Child Left Behind Act, it was actually improving at a faster rate before the passage of the No Child Left Behind Act.

That link is to a NAEP report that indeed provides descriptive evidence supporting Tucker’s point. However, there are at least two peer-reviewed articles using NAEP data that show positive causal impacts of NCLB using high-quality quasi-experimental design, one on fourth grade math achievement only and the other on fourth and eighth grade math achievement and (suggestively) fourth grade reading. The latter is, to my eye, the most rigorous analysis that yet exists on this topic. There is a third article that uses cross-state NAEP data and does not find an impact, but again the most recent analysis by Wong seems to me to be the most methodologically sophisticated of the lot and, therefore, the most trustworthy. I think if Tucker wants to talk NAEP data, he has to provide evidence of this quality that supports his position of “no effect” (or even “harm,” as he appears to be suggesting). Is there a quality analysis using a strong design that shows a negative impact on the slope of achievement gains caused by NCLB? I do not know of one.

I should also note that there are beaucoup within-state studies of the impacts of accountability policies that use regression discontinuity designs and find causal impacts. For instance: in North Carolina, in Florida, and in Wisconsin. In short: I don’t see any way to read the causal literature on school accountability and conclude that it has negative impacts on student achievement. I don’t even see any way to conclude it has neutral impacts, given the large number of studies finding positive impacts relative to those with strong designs that find no impacts.

2) Next, Tucker says:

Over the 15-year history of the No Child Left Behind Act, there is no data to show that it contributed to improved student performance for poor and minority students at the high school level, which is where it counts.

Here I think Mark is moving the goalposts a bit. Is high school performance of poor and minority students the target? Then I guess we may as well throw out all the above-cited studies. I know of no causal studies that directly investigate the impact on this particular outcome, so I think the best he’s got is the NAEP trends. And sure, trends in high school performance are relatively flat.

I’m not one to engage in misNAEPery, however, so I wouldn’t make too much of this. Nor would I make too much of the fact that high school graduation rates have increased for all groups (meaning tons more low-performing students who in days gone by would have dropped out are still around to take the NAEP in 12th grade, among other things). But I would make quite a bit of the fact that the above-cited causal studies obviously also apply to historically underserved groups (that is, while they rarely directly test the impact of accountability on achievement gaps, they very often test the impacts for different groups and find that all groups see the positive effects). And I would also note some evidence from North Carolina of direct narrowing effects on black-white gaps.

3) Next, we have:

Many nations that have no annual accountability testing requirements have higher average performance for poor and minority students and smaller gaps between their performance and the performance of majority students than we do here in the United States.  How can annual testing be a civil right if that is so?

There’s not much to say about this. It’s not based on any study I know of, certainly none that would suggest a causal impact one way or the other. But he’s right that we’re relatively alone in our use of annual testing, and therefore that many higher-achieving nations don’t have annual testing. They also don’t have many other policies that we have, so I’m not sure what’s to be learned from this observation.

4) Now he moves on to claim:

It is not just that annual accountability testing with separate scores for poor and minority students does not help those students.  The reality is that it actually hurts them. All that testing forces schools to buy cheap tests, because they have to administer so many of them.  Cheap tests measure low-level basic skills, not the kind of high-level, complex skills most employers are looking for these days.  Though students in wealthy communities are forced to take these tests, no one in those communities pays much attention to them.  They expect much more from their students. It is the schools serving poor and minority students that feed the students an endless diet of drill and practice keyed to these low-level tests.  The teachers are feeding these kids a dumbed down curriculum to match the dumbed down tests, a dumbed down curriculum the kids in the wealthier communities do not get.

This paragraph doesn’t have links, probably because it’s not well supported by the existing evidence. Certainly you hear this argument all the time, and I believe it may well be true that schools serving poor kids have worse curricula or more perverse responses to tests (even some of my own work suggests different kinds of instructional responses in different kinds of schools). But even if we grant that this impact is real, the literature on achievement effects certainly does not suggest harm. And the fact that graduation rates are skyrocketing certainly does not suggest harm. If he’s going to claim harm, he has to provide clear, compelling evidence of harm. This ain’t it. And finally here, a small point. I hate when people say schools are “forced” to do anything. States, districts, and schools were not forced to buy bad tests before. They have priorities, and they have prioritized cheap and fast. That’s a choice, not a matter of force.

5) Next, Tucker claims:

Second, the teachers in the schools serving mainly poor and minority kids have figured out that, from an accountability standpoint, it does them no good to focus on the kids who are likely to pass the tests, because the school will get no credit for it. At the same time, it does them no good to focus on the kids who are not likely to pass no matter what the teacher does, because the school will get no credit for that either. As a result, the faculty has a big incentive to focus mainly on the kids who are just below the pass point, leaving the others to twist in the wind.

I am certainly familiar with the literature cited here, and I don’t dispute any of it. Quite the contrary, I acknowledge the conclusion that the students who are targeted by the accountability system see the greatest gains. This has been shown in many well-designed studies, such as here, here, here, and here. But this an argument about accountability policy design, not about annual testing. It simply speaks to the need for better accountability policies. For instance, suppose we thought the “bubble kids” problem was a bad one that needed solving. We could solve it tomorrow–simply create a system where all that matters is growth. Voila, no bubble kids! Of course there would be tradeoffs to that decision, so probably some mixture is better.

6) Then Tucker moves on to discuss the teaching force:

Not only is it true that annual accountability testing does not improve the performance of poor and minority students, as I just explained, but it is also true that annual accountability testing is making a major contribution to the destruction of the quality of our teaching force.

There’s no evidence for this. I know of not a single study that suggests that there is even a descriptive decrease in the quality of our teaching force in recent years. Certainly not one with a causal design of any kind that implicates annual accountability testing. And there is recent evidence that suggests improvements in the quality of the workforce, at least in certain areas such as New York and Washington.

7) Next, he takes on the distribution of teacher quality:

One of the most important features of these accountability systems is that they operate in such a way as to make teachers of poor and minority students most vulnerable.  And the result of that is that more and more capable teachers are much less likely to teach in schools serving poor and minority students.

It is absolutely true that the lowest quality teachers are disproportionately likely to serve the most disadvantaged students. But I know of not a single piece of evidence that this is caused by (or even made worse by) annual testing and accountability policies. My hunch is that this has always been true, but that’s just a hunch. If Tucker has evidence, he should provide it.

8) The final point is one that hits close to home:

Applications to our schools of education are plummeting and deans of education are reporting that one of the reasons is that high school graduates who have alternatives are not selecting teaching because it looks like a battleground, a battleground created by the heavy-handed accountability systems promoted by the U.S. Department of Education and sustained by annual accountability testing.

As someone employed at a school of education, I can say the first clause here is completely true. And we’re quite worried about it. But again, I know of not a single piece of even descriptive evidence that suggests this is due to annual accountability testing. Annual accountability testing has been around for well over a decade. Why would the impact be happening right now?

I think these are the main arguments in Tucker’s piece, and I have provided evidence or argumentation here that suggests that not one of them is supported by the best academic research that exists today. Perhaps the strongest argument of the eight is the second one, but again I know of no quality research that attributes our relative stagnation on 12th grade NAEP to annual accountability testing. That does not mean Tucker is wrong. But it does mean that he is the one who should bear the burden of providing evidence to support his positions, not Haycock and Edelman. I don’t believe he can produce such evidence, because I don’t believe it exists.


[1] I think it’s almost universally a bad idea to tell civil rights leaders what to do.

Research you should read – on the impact of NCLB

This is the first in what will be a mainstay of this blog–a discussion of a recent publication (peer-reviewed or not) that I think more folks should be reading and citing. Today’s article is both technically impressive and substantively important. It has the extremely un-thrilling name “Adding Design Elements to Improve Time Series Designs: No Child Left Behind as an Example of Causal Pattern-Matching,” and it appears in the most recent issue of the Journal for Research on Educational Effectiveness (the journal of the excellent SREE organization) [1].

The methodological purpose of this article is to add “design elements” to the Comparative Interrupted Time Series design (a common quasi-experimental design used to evaluate the causal impact of all manner of district- or state-level policies). The substantive purpose of this article is to identify the causal impact of NCLB on student achievement using NAEP data. While the latter has already been done (see for instance Dee and Jacob), this article strengthens Dee and Jacob’s findings through their design elements analysis.

In essence, what design elements bring to the CITS design for evaluating NCLB is a greater degree of confidence in the causal conclusions. Wong and colleagues, in particular, demonstrate NCLB’s impacts in multiple ways:

  • By comparing public and Catholic schools.
  • By comparing public and non-Catholic private schools.
  • By comparing states with high proficiency bars and low ones.
  • By using tests in 4th and 8th grade math and 4th grade reading.
  • By using Main NAEP and long-term trend NAEP.
  • By comparing changes in mean scores and time-trends.

The substantive findings are as follows:

1. We now have national estimates of the effects of NCLB by 2011.

2. We now know that NCLB affected eighth-grade math, something not statistically confirmed in either Wong, Cook, Barnett, and Jung (2008) or Dee and Jacob (2011) where positive findings were limited to fourth-grade math.

3. We now have consistent but statistically weak evidence of a possible, but distinctly smaller, fourth-grade reading effect.

4. Although it is not clear why NCLB affected achievement, some possibilities are now indicated.

These possibilities include a) consequential accountability, b) higher standards, and c) the combination of the two.

So why do I like this article so much? Well, of course, one reason is because it supports what I believe to be the truth about consequential standards-based accountability–that it has real, meaningfully large impacts on student outcomes [2][3]. But I also think this article is terrific because of its incredibly thoughtful design and execution and its clever use of freely available data. Regardless of one’s views on NCLB, this should be an article for policy researchers to emulate. And that’s why you should read it.


[1] This article, like many articles I’ll review on this blog, is paywalled. If you want a PDF and don’t have access through your library, send me an email.

[2] See this post for a concise summary of my views on this issue.

[3] Edited to add: I figured it would be controversial to say that I liked an article because it agreed with my priors. Two points. First, I think virtually everyone prefers research that agrees with their priors, so I’m merely being honest; deal with it. Second, as Sherman Dorn points out via Twitter, this is conjunctional–I like it because it’s a very strong analysis AND it agrees with my priors. If it was a shitty analysis that agreed with my priors, I wouldn’t have blogged about it.