Do the content and quality of state tests matter?

Over at Ahead of the Heard, Chad Aldeman has written about the recent Mathematica study, which found that PARCC and MCAS were equally predictive of early college success. He essentially argues that if all tests are equally predictive, states should just choose the cheapest bargain-basement test, content and quality be damned. He offers a list of reasons, which you’re welcome to read.

As you’d guess, I disagree with this argument. I’ll offer a list of reasons of my own here.

  1. The most obvious point is that we have reasonable evidence that testing drives instructional responses to standards. Thus, if the tests used to measure and hold folks/schools accountable are lousy and contain poor quality tasks, we’ll get poor quality instruction as well. This is why many folks are thinking these days that better tests should include tasks that are much closer to the kinds of things we want kids to actually be doing. In that case, “teaching to the test” becomes “good teaching.” May be a pipe dream, but that’s something I commonly hear.
  2. A second fairly obvious point is that switching to a completely unaligned test would end any possible notion that the tests could provide feedback to teachers about what they should be doing differently/better. Certainly we can all argue that current test results are provided too late to be useful–though smart testing vendors ought to be working on this issue as hard as possible–but if the test is in no way related to what teachers are supposed to be teaching, it’s definitely useless to them as a formative measure.
  3. Chad’s analysis seems to prioritize predictive validity–how well do results from the test predict other desired outcomes–over all the other types of validity evidence. It’s not clear to me why we should prefer predictive validity (especially when we already have evidence that GPAs do better at that than most standardized tests, though SAT/ACT adds a little) over, say, content-related validity. Don’t we first and foremost want the test to be a good measure of what students were supposed to have learned in the grade? More generally, I think it makes more sense to have different tests for different purposes, rather than piling all the purposes into a single test.
  4. Certainly if the tests are going to have stakes attached to them, the courts require a certain level of content validity (or what they’ve called instructional validity). See Debra P. v. Turlington. If a kid’s going to be held accountable, they need to have had the opportunity to learn what was on the test. If the test is the SAT, that’s probably not going to happen.

Anyway, take a look at the Mathematica report (you should anyway!) and Chad’s post and let me know what you think.

Research you should read: On the distribution of teachers

Today’s installment of “Research you should read” comes to us from Educational Researcher. The paper is “Uneven playing field? Assessing the teacher quality gap between advantaged and disadvantaged students,” and it’s by Dan Goldhaber and colleagues. This is a beautifully done analysis that accomplishes several goals:

  1. It quantifies the degree of teacher sorting based on multiple teacher characteristics, including both input (e.g., credentials) and output (e.g., estimates of effectiveness) measures.
  2. It examines that sorting across multiple indicators of student disadvantage.
  3. It does (1) and (2) for an entire state.
  4. It identifies the sources of the inequitable distribution (e.g., is it mostly due to between-school or within-school sorting?).

The results are intensely sobering, if not at all surprising:

We demonstrate that in Washington state elementary school, middle school, and high school classrooms, virtually every measure of teacher quality—experience, licensure exam score, and value-added estimates of effectiveness—is inequitably distributed across every indicator of student disadvantage—free/reduced-price lunch status (FRL), underrepresented minority (URM), and low prior academic performance (the sole exception being licensure exam scores in high school math classrooms).

In short, poor kids, kids of color, and low-achieving kids systematically get access to lower quality teachers, any way you define “quality” [1].

The authors also note that most of the sorting is between schools and between districts, rather than within schools, at least for most of these measures. This is also not surprising, but it of course makes addressing this problem all the more difficult. It’s one thing to reassign teachers within schools (though even that is probably much easier said than done). It’s an entirely different thing to find ways to redistribute teachers across schools or districts without raising the hackles of the broad swath of the electorate who wants government to get their hands off the public education system.

There are undoubtedly many causes of this (frankly, abhorrent) set of findings. The authors list or suggest several:

  • Higher-quality teachers are more likely to leave districts serving more disadvantaged kids, likely because of both pay and working conditions.
  • Existing pay structures create little incentive to work in more disadvantaged settings (often it’s the opposite–the more disadvantaged districts pay less than the tonier suburban districts).
  • Student teaching may contribute to sorting, with the most advantaged districts snatching up the most qualified candidates.
  • Collective bargaining agreements often give more senior teachers preference in terms of teaching assignments, which they use to make within-district transfers from more to less disadvantaged schools.
  • School leaders may give their best or most experienced teachers within-school preferences in terms of teaching assignments.

These are not easily remedied, but certainly there are policy innovations that might help. The most obvious is that we should pay teachers who teach in more disadvantaged settings more, not less. This certainly is true between districts, but it ought to be true within districts as well. The authors cite evidence that these bonuses can induce desirable behaviors. Another is that we really need to work on the underlying challenges of working in more disadvantaged schools, including working conditions. Several recent studies have shown the powerful influence of working conditions on teachers’ employment decisions and their improvement as professionals.

I do not know whether state or federal policymakers should get involved in this issue. As a big government guy who is concerned about the way our school system treats those who are most disadvantaged, my inclination is to say yes. My hope is that some states can lead the way, creating new laws and systems that, at a minimum, make it equally likely that a poor kid and a rich one in a public school can get access to a good teacher. The status quo on this issue clearly is not working for our most disadvantaged kids.


[1] Of course there could be some other undefined measure of quality that’s not distributed this way, but I’ve not seen any evidence of that.

Research you should read – on the impact of NCLB

This is the first in what will be a mainstay of this blog–a discussion of a recent publication (peer-reviewed or not) that I think more folks should be reading and citing. Today’s article is both technically impressive and substantively important. It has the extremely un-thrilling name “Adding Design Elements to Improve Time Series Designs: No Child Left Behind as an Example of Causal Pattern-Matching,” and it appears in the most recent issue of the Journal for Research on Educational Effectiveness (the journal of the excellent SREE organization) [1].

The methodological purpose of this article is to add “design elements” to the Comparative Interrupted Time Series design (a common quasi-experimental design used to evaluate the causal impact of all manner of district- or state-level policies). The substantive purpose of this article is to identify the causal impact of NCLB on student achievement using NAEP data. While the latter has already been done (see for instance Dee and Jacob), this article strengthens Dee and Jacob’s findings through their design elements analysis.

In essence, what design elements bring to the CITS design for evaluating NCLB is a greater degree of confidence in the causal conclusions. Wong and colleagues, in particular, demonstrate NCLB’s impacts in multiple ways:

  • By comparing public and Catholic schools.
  • By comparing public and non-Catholic private schools.
  • By comparing states with high proficiency bars and low ones.
  • By using tests in 4th and 8th grade math and 4th grade reading.
  • By using Main NAEP and long-term trend NAEP.
  • By comparing changes in mean scores and time-trends.

The substantive findings are as follows:

1. We now have national estimates of the effects of NCLB by 2011.

2. We now know that NCLB affected eighth-grade math, something not statistically confirmed in either Wong, Cook, Barnett, and Jung (2008) or Dee and Jacob (2011) where positive findings were limited to fourth-grade math.

3. We now have consistent but statistically weak evidence of a possible, but distinctly smaller, fourth-grade reading effect.

4. Although it is not clear why NCLB affected achievement, some possibilities are now indicated.

These possibilities include a) consequential accountability, b) higher standards, and c) the combination of the two.

So why do I like this article so much? Well, of course, one reason is because it supports what I believe to be the truth about consequential standards-based accountability–that it has real, meaningfully large impacts on student outcomes [2][3]. But I also think this article is terrific because of its incredibly thoughtful design and execution and its clever use of freely available data. Regardless of one’s views on NCLB, this should be an article for policy researchers to emulate. And that’s why you should read it.


[1] This article, like many articles I’ll review on this blog, is paywalled. If you want a PDF and don’t have access through your library, send me an email.

[2] See this post for a concise summary of my views on this issue.

[3] Edited to add: I figured it would be controversial to say that I liked an article because it agreed with my priors. Two points. First, I think virtually everyone prefers research that agrees with their priors, so I’m merely being honest; deal with it. Second, as Sherman Dorn points out via Twitter, this is conjunctional–I like it because it’s a very strong analysis AND it agrees with my priors. If it was a shitty analysis that agreed with my priors, I wouldn’t have blogged about it.