Testing my patience

PBS is out with a truly awful report on testing/opt out/Common Core. You can watch it here and read one takedown here.

I’m not going to do a full takedown, but I’ll highlight a few points that weren’t made by Will Ragland.

  1. Hagopian says testing is a multi-billion dollar industry. That’s true but overwrought and misleading. We have 50 million kids in school–spend $20 a kid per year and you’re at a billion. Yes, we spend billions on evaluating how well kids are learning. That’s far less than 1% of our total education dollars, in order to offer some evaluation of how our system is doing. Seems like a perfectly reasonable amount to me (if anything, it’s too little, and our limited spending on assessment has resulted in some of the poor quality tests we’ve seen over the years). Saving that <<1% wouldn’t really do anything to reduce class sizes or boost teacher salaries or whatever else Hagopian would like us to do, even if we cut testing expenses to 0.
  2. There’s an almost farcically absurd analogy that testing proponents think a kid with hypothermia just needs to have his temperature taken over and over again, whereas teachers just know to wrap the kid in the blanket. First of all, given horrendous outcomes for many kids, it seems like at least a handful of educators (or perhaps more accurately, the system as a whole) has neglected their blanketing duties more often than we’d care to note. Second, these test data are used in dozens of ways to help support and improve schools, especially in states that have waivers (which, admittedly, Washington is not one).
  3. Complaining about a test-and-punish philosophy in Washington State is pretty laughable, since there’s no exit exam for kids [CORRECTION: there appears to be some new exit exam requirements being rolled out in the state, though students did not opt out of these exams; apologies that I did not catch these earlier; I was referring to old data], no high-stakes teacher evaluation, and less accountability for schools than there was during the NCLB era (though parents did get a letter about their school’s performance …). Who, exactly, is being punished, and how?
  4. Finally, the report lumps together Common Core with all kinds of things that are not related to Common Core, such as the 100+ standardized test argument and the MAP test. Common Core says literally nothing at all about testing, and it certainly doesn’t have anything to do with a district-level benchmark test.

It shouldn’t be asking that much for a respected news organization to get very basic details about major education policies that have existed for 4+ year correct. Instead, we get misleading, unbalanced nonsense that will contribute to the tremendous levels of misinformation we see among voters about education policy.

Advertisements

Testing tradeoffs

Life is a series of tradeoffs. Perhaps nowhere in education is that clearer than in assessment policy.

What brings this to mind are Motoko Rich’s and Catherine Gewertz’s recent articles about scoring Common Core tests. I think both of these articles are good, and they both illustrate some of the challenges of doing what we’re trying to do at scale. But it’s also clear that some anti-test folks are using these very complicated issues as fodder for their agendas, and that’s disappointing (if totally expected). Here are some of the key quotes from Motoko’s article, and the tradeoffs they illustrate.

On Friday, in an unobtrusive office park northeast of downtown here, about 100 temporary employees of the testing giant Pearson worked in diligent silence scoring thousands of short essays written by third- and fifth-grade students from across the country. There was a onetime wedding planner, a retired medical technologist and a former Pearson saleswoman with a master’s degree in marital counseling. To get the job, like other scorers nationwide, they needed a four-year college degree with relevant coursework, but no teaching experience. They earned $12 to $14 an hour, with the possibility of small bonuses if they hit daily quality and volume targets.

Tradeoff: We think we want teachers to be involved in the scoring of these tests (presumably because we believe there is some special expertise that teachers possess) [1]. But teachers cost more than $12 to $14 an hour, and we’re in an era where every dollar spent on testing is endlessly scrutinized, so we have to instead use some educated people who are not teachers.

At times, the scoring process can evoke the way a restaurant chain monitors the work of its employees and the quality of its products. “From the standpoint of comparing us to a Starbucks or McDonald’s, where you go into those places you know exactly what you’re going to get,” said Bob Sanders, vice president of content and scoring management at Pearson North America, when asked whether such an analogy was apt.

Tradeoff: We have a huge system in this country, and we want results that are comparable across schools. But comparability in a large system requires some degree of standardization, and standardization at that level of scale requires processes that look, well, standardized and corporate.

For exams like the Advanced Placement tests given by the College Board, scorers must be current college professors or high school teachers who have at least three years of experience teaching the subject they are scoring.

Tradeoff: We want to test everyone. This means the volume for scoring is tremendously larger than the AP exam (about 12 million test takers vs. about 1 million), which again means we may not be able to find enough teachers to do the work.

“You’re asking people still, even with the best of rubrics and evidence and training, to make judgments about complex forms of cognition,” Mr. Pellegrino said. “The more we go towards the kinds of interesting thinking and problems and situations that tend to be more about open-ended answers, the harder it is to get objective agreement in scoring.”

Tradeoff: We want more challenging, open-ended, complex tasks. But scoring those tasks at scale is harder to do reliably.

There are of course other big tradeoffs that aren’t highlighted in these articles. For instance:

  • The tradeoff between test cost and transparency–building items is very expensive, so releasing items and having to create new ones every year would add to test costs while enhancing transparency.
  • The tradeoff between testing time and the nature of the task–multiple choice items are quicker to complete, but they may not fully tap the kinds of skills we want to measure.
  • The tradeoff between testing time and the comprehensiveness of the assessment–shorter tests can probably give us a reasonable estimate of overall math and reading proficiency, but they will not give us the fine-grained, actionable data we might want to make instructional responses (and they might contribute to “narrowing the curriculum” if they repeatedly sample the same content).
  • The tradeoffs of open-response items with fast scoring–multiple choice items, especially on computers, can be scored virtually instantaneously, whereas open-response items take time to score. So faster feedback may butt up against our desire for better items.
  • The tradeoffs associated with testing on computers–e.g., using money to purchase computers vs. other things, advantages of adaptive testing vs. needing to teach kids how to take tests on computers.

I will also note that this kind of reporting could, in my mind, be strengthened with more empirical evidence. For instance,

“Even as teachers, we’re still learning what the Common Core state standards are asking,” Ms. Siemens said. “So to take somebody who is not in the field and ask them to assess student progress or success seems a little iffy.”

Are teachers better scorers than non-teachers, or not? That’s an empirical question. I would be reasonably confident that Pearson has in place a good process for determining who are the best scorers from the standpoint of reliability. Some of the best scorers are teachers, and some are not.

Some teachers question whether scorers can grade fairly without knowing whether a student has struggled with learning difficulties or speaks English as a second language.

Is there evidence that the test scoring is biased against students with disabilities or ELLs, or not? That’s also an empirical question. Again I would guess that Pearson has in place a process to weed out construct-irrelevant variance to the maximum extent possible.

Overall, I think it’s great that writers like Motoko and Catherine are tackling these challenging issues. But I hope it’s not lost on readers that, like everything in life, testing requires tradeoffs that are not easily navigated.


[1] It’s not obvious to me this is true, though it may well be. Regardless, it would likely be a good professional development opportunity to score items.

Some quick thoughts on opt out

In general, I have not opined much on the subject of “opt out,” for a number of reasons. First, there’s little/no good data or research on the topic, so my opinions can’t be as informed as I would typically like them to be. Second, I don’t know that I have much to add on the issue (and yet I’m about to give my two cents). Third, it’s a trend that actively worries me as someone who believes research clearly shows that tests and accountability have been beneficial overall. I don’t really see much policymakers can do to stop this trend short of requiring public school students to test [1].

Despite my best efforts to avoid the subject, over on Twitter, former MCPS Superintendent Joshua Starr asked me what I think of this EdWeek commentary on opt out. Here are some excerpts of their argument and my reactions.

First, the title is “Test-taking ‘compliance’ does not ensure equity.” Probably the authors did not write this title, but it’s a very weak straw man. I know of few/any folks who believe that test-taking compliance ensures equity. I certainly don’t believe that. I do believe having good data can help equity, but it certainly doesn’t ensure it.

Some parents have elected to opt their children out of the annual tests as a message of protest, signaling that a test score is not enough to ensure excellence and equity in the education of their children. Parents, they insist, have a right to demand an enriched curriculum that includes the arts, civics, and lab sciences, and high-quality schools in their neighborhoods.

I don’t have good evidence on this (I don’t think anyone yet does, but hopefully several savvy doctoral students are work on this topic), but my very strong sense is that the folks opting out of tests are not typically doing it as an equity protest. Everything I’ve seen and heard so far says this is largely, but not exclusively, a white, upper-middle class, suburban/rural phenomenon [EDITED TO ADD: Matt Chingos has done a preliminary analysis of this issue and largely agrees with this characterization: http://www.brookings.edu/research/papers/2015/06/18-chalkboard-who-opts-out-chingos%5D. My conversations with educators in California, for instance, suggest that the high rates of opt-out in high schools in some affluent areas are because the exam was seen as meaningless and interfering with students’ abilities to prepare for other exams that actually matter to students (e.g., APs, SAT).

Since it was signed into law in 2002, No Child Left Behind has done little to advance the educational interests of our most disadvantaged students. What’s more, the high-stakes-testing climate that NCLB created has also been connected to increased discipline rates for students of color and students with disabilities.

I think the first sentence there is not correct–as I showed in the previous post, there’s evidence that achievement has increased due to NCLB for all groups, including the most disadvantaged (but not much evidence it has narrowed gaps). I’m not aware of well designed research showing the latter claim, but that’s not my area. Regardless, as I also discussed in the last post, sweeping claims of harm to disadvantaged students are hard to square with empirical evidence on outcomes such as test scores and graduation rates.

And even after these tests reveal large outcome gaps, schools serving poor children of color remain underfunded and are more likely to be labeled failing. Most states have done nothing to intervene effectively in these schools, even when state officials have taken over school districts. Moreover, despite NCLB’s stated goal of closing the achievement gap, wide disparities in academic outcomes persist.

I think this is mostly true, though of course it depends on state (some states are much more adequate and equitable in their funding than others). And the lack of intervention in low-performing schools really is about a lack of effective intervention, though I’d be very curious what interventions these authors would recommend. It’s true that achievement gaps persist, though I believe racial (but not income) gaps are about as small now as they’ve ever been.

We are not opposed to assessments, especially when they are used for diagnostic purposes to support learning. But the data produced by annual standardized tests are typically not made available to teachers until after the school year is over, thereby making it impossible to use the information to respond to student needs.

Some of the new state tests get data back faster. For instance, some California results were made available to teachers before the end of the year. In general I think it’s a bad idea to heap too many different goals for a single test. It’s not clear to me that we always want our accountability test to also be our formative, immediate feedback test–those probably should be different tests. But that doesn’t necessarily obviate the need for an external accountability test.

Thus, students of color are susceptible to all of the negative effects of the annual assessments, without any of the positive supports to address the learning gaps. When testing is used merely to measure and document inequities in outcomes, without providing necessary supports, parents have a right to demand more.

Again I think the intention of both the original NCLB and the waivers was that, in the early years of school “failure” students would be provided with additional supports and options (e.g., through supplemental education services and public school choice) to improve. Those turn out not to have worked, and perhaps future supports will not either, but it’s not necessarily for lack of effort. I’m curious what specific supports these authors would advocate, bearing in mind the intense hostility among half our nation to raising any additional funds for schools or anything else.

The civil rights movement has never supported compliance with unjust laws and policies. Rather, it has always worked to challenge them and support the courageous actions of those willing to resist. As young people and their allies protest throughout the country against police brutality, demanding that “black lives matter,” we are reminded that the struggle for justice often forces us to hold governments and public officials accountable to reject the status quo. Today’s status quo in education is annual assessments that provide no true path toward equity or excellence.

This strikes me as a stretch, though I agree with the first half of it. I’m not sure the “black lives matter” movement was really about holding the government accountable to reject the status quo, as much as it was about holding both government and individuals accountable for centuries of unjust laws and actions (but this is not remotely my area).

The anti-testing movement will not be intimidated, nor is it going away.

I think that’s right. Though reducing or eliminating teacher accountability based on state tests would probably at least reduce the extent to which the unions are actively encouraging opt-outs.

Some may choose to force districts to adopt a more comprehensive “dashboard” accountability system with multiple measures. Others may push districts to engage in biennial or grade-span testing, and still others may choose to opt out. What remains clear is that parents want more than tests to assess their children’s academic standing and, as a result, are choosing to opt out of an unjust, ineffective policy.

With respect to the first sentence, some states did this (and all states had the opportunity to do this in their waivers). With respect to the sentence, it’s not clear to me how biennial or grade-span testing is any more “just” than yearly testing. Perhaps if these authors stated what they think is the optimal testing regimen from a “justice” perspective, that would help.

So, I don’t think it’s an especially convincing argument. But I don’t know that the pro-opt-out movement really needs convincing arguments. If parents have the right to opt their kids out of tests, at least some of them will do so. I suspect this will lead to increased inequity, but that’s an empirical question for another day.


[1] Were I omnipotent, I would enact that rule, and I’d also require private and homeschool kids to test.

A (quick, direct, 2000 word) response to Tucker on testing

There’s been a bit of a kerfuffle recently in the edu-Twittersphere, since Marc Tucker suggested that civil rights leaders ought to reconsider their support for annual testing [1]. Kati Haycock and Jonah Edelman wrote impassioned responses, which Tucker has just dismissed as not responding to his substantive arguments. He ends with this paragraph:

The facts ought to count for something. What both of these critiques come down to is an assertion that I don’t have any business urging established leaders of the civil rights community to reconsider the issue, that I simply don’t understand the obvious—that annual accountability testing is essential to justice for poor and minority students, that anyone who thinks otherwise must be in the pocket of the teachers unions.  Well, it is not obvious. Indeed, all the evidence says it is not true. And anyone who knows me knows that I am in no one’s pocket. I know the leaders of the civil rights community to be people of great integrity.  They aren’t in anyone’s pocket, either. I think they want what is best for the people they represent. And I do not think that is annual testing.

I think Mr. Tucker greatly overstates the evidence in his initial post, so I’m going to do my best to give a very brief and direct response to the substantive arguments he makes there. I do this not to defend Haycock and Edelman (whom I do not really know), but to defend the policy, which I believe is unfairly maligned in Tucker’s posts.

Let me start by saying that I am generally in favor of annual testing, though I am probably not as fervid in that support as some others in the “reform” camp. I do not believe that annual accountability testing is essential to justice for poor and minority students, but I do think high-quality tests at reasonable intervals would almost certainly be beneficial to them.

Okay, here goes.

1) In his initial post, Marc Tucker says,

First of all, the data show that, although the performance of poor and minority students improved after passage of the No Child Left Behind Act, it was actually improving at a faster rate before the passage of the No Child Left Behind Act.

That link is to a NAEP report that indeed provides descriptive evidence supporting Tucker’s point. However, there are at least two peer-reviewed articles using NAEP data that show positive causal impacts of NCLB using high-quality quasi-experimental design, one on fourth grade math achievement only and the other on fourth and eighth grade math achievement and (suggestively) fourth grade reading. The latter is, to my eye, the most rigorous analysis that yet exists on this topic. There is a third article that uses cross-state NAEP data and does not find an impact, but again the most recent analysis by Wong seems to me to be the most methodologically sophisticated of the lot and, therefore, the most trustworthy. I think if Tucker wants to talk NAEP data, he has to provide evidence of this quality that supports his position of “no effect” (or even “harm,” as he appears to be suggesting). Is there a quality analysis using a strong design that shows a negative impact on the slope of achievement gains caused by NCLB? I do not know of one.

I should also note that there are beaucoup within-state studies of the impacts of accountability policies that use regression discontinuity designs and find causal impacts. For instance: in North Carolina, in Florida, and in Wisconsin. In short: I don’t see any way to read the causal literature on school accountability and conclude that it has negative impacts on student achievement. I don’t even see any way to conclude it has neutral impacts, given the large number of studies finding positive impacts relative to those with strong designs that find no impacts.

2) Next, Tucker says:

Over the 15-year history of the No Child Left Behind Act, there is no data to show that it contributed to improved student performance for poor and minority students at the high school level, which is where it counts.

Here I think Mark is moving the goalposts a bit. Is high school performance of poor and minority students the target? Then I guess we may as well throw out all the above-cited studies. I know of no causal studies that directly investigate the impact on this particular outcome, so I think the best he’s got is the NAEP trends. And sure, trends in high school performance are relatively flat.

I’m not one to engage in misNAEPery, however, so I wouldn’t make too much of this. Nor would I make too much of the fact that high school graduation rates have increased for all groups (meaning tons more low-performing students who in days gone by would have dropped out are still around to take the NAEP in 12th grade, among other things). But I would make quite a bit of the fact that the above-cited causal studies obviously also apply to historically underserved groups (that is, while they rarely directly test the impact of accountability on achievement gaps, they very often test the impacts for different groups and find that all groups see the positive effects). And I would also note some evidence from North Carolina of direct narrowing effects on black-white gaps.

3) Next, we have:

Many nations that have no annual accountability testing requirements have higher average performance for poor and minority students and smaller gaps between their performance and the performance of majority students than we do here in the United States.  How can annual testing be a civil right if that is so?

There’s not much to say about this. It’s not based on any study I know of, certainly none that would suggest a causal impact one way or the other. But he’s right that we’re relatively alone in our use of annual testing, and therefore that many higher-achieving nations don’t have annual testing. They also don’t have many other policies that we have, so I’m not sure what’s to be learned from this observation.

4) Now he moves on to claim:

It is not just that annual accountability testing with separate scores for poor and minority students does not help those students.  The reality is that it actually hurts them. All that testing forces schools to buy cheap tests, because they have to administer so many of them.  Cheap tests measure low-level basic skills, not the kind of high-level, complex skills most employers are looking for these days.  Though students in wealthy communities are forced to take these tests, no one in those communities pays much attention to them.  They expect much more from their students. It is the schools serving poor and minority students that feed the students an endless diet of drill and practice keyed to these low-level tests.  The teachers are feeding these kids a dumbed down curriculum to match the dumbed down tests, a dumbed down curriculum the kids in the wealthier communities do not get.

This paragraph doesn’t have links, probably because it’s not well supported by the existing evidence. Certainly you hear this argument all the time, and I believe it may well be true that schools serving poor kids have worse curricula or more perverse responses to tests (even some of my own work suggests different kinds of instructional responses in different kinds of schools). But even if we grant that this impact is real, the literature on achievement effects certainly does not suggest harm. And the fact that graduation rates are skyrocketing certainly does not suggest harm. If he’s going to claim harm, he has to provide clear, compelling evidence of harm. This ain’t it. And finally here, a small point. I hate when people say schools are “forced” to do anything. States, districts, and schools were not forced to buy bad tests before. They have priorities, and they have prioritized cheap and fast. That’s a choice, not a matter of force.

5) Next, Tucker claims:

Second, the teachers in the schools serving mainly poor and minority kids have figured out that, from an accountability standpoint, it does them no good to focus on the kids who are likely to pass the tests, because the school will get no credit for it. At the same time, it does them no good to focus on the kids who are not likely to pass no matter what the teacher does, because the school will get no credit for that either. As a result, the faculty has a big incentive to focus mainly on the kids who are just below the pass point, leaving the others to twist in the wind.

I am certainly familiar with the literature cited here, and I don’t dispute any of it. Quite the contrary, I acknowledge the conclusion that the students who are targeted by the accountability system see the greatest gains. This has been shown in many well-designed studies, such as here, here, here, and here. But this an argument about accountability policy design, not about annual testing. It simply speaks to the need for better accountability policies. For instance, suppose we thought the “bubble kids” problem was a bad one that needed solving. We could solve it tomorrow–simply create a system where all that matters is growth. Voila, no bubble kids! Of course there would be tradeoffs to that decision, so probably some mixture is better.

6) Then Tucker moves on to discuss the teaching force:

Not only is it true that annual accountability testing does not improve the performance of poor and minority students, as I just explained, but it is also true that annual accountability testing is making a major contribution to the destruction of the quality of our teaching force.

There’s no evidence for this. I know of not a single study that suggests that there is even a descriptive decrease in the quality of our teaching force in recent years. Certainly not one with a causal design of any kind that implicates annual accountability testing. And there is recent evidence that suggests improvements in the quality of the workforce, at least in certain areas such as New York and Washington.

7) Next, he takes on the distribution of teacher quality:

One of the most important features of these accountability systems is that they operate in such a way as to make teachers of poor and minority students most vulnerable.  And the result of that is that more and more capable teachers are much less likely to teach in schools serving poor and minority students.

It is absolutely true that the lowest quality teachers are disproportionately likely to serve the most disadvantaged students. But I know of not a single piece of evidence that this is caused by (or even made worse by) annual testing and accountability policies. My hunch is that this has always been true, but that’s just a hunch. If Tucker has evidence, he should provide it.

8) The final point is one that hits close to home:

Applications to our schools of education are plummeting and deans of education are reporting that one of the reasons is that high school graduates who have alternatives are not selecting teaching because it looks like a battleground, a battleground created by the heavy-handed accountability systems promoted by the U.S. Department of Education and sustained by annual accountability testing.

As someone employed at a school of education, I can say the first clause here is completely true. And we’re quite worried about it. But again, I know of not a single piece of even descriptive evidence that suggests this is due to annual accountability testing. Annual accountability testing has been around for well over a decade. Why would the impact be happening right now?

I think these are the main arguments in Tucker’s piece, and I have provided evidence or argumentation here that suggests that not one of them is supported by the best academic research that exists today. Perhaps the strongest argument of the eight is the second one, but again I know of no quality research that attributes our relative stagnation on 12th grade NAEP to annual accountability testing. That does not mean Tucker is wrong. But it does mean that he is the one who should bear the burden of providing evidence to support his positions, not Haycock and Edelman. I don’t believe he can produce such evidence, because I don’t believe it exists.


[1] I think it’s almost universally a bad idea to tell civil rights leaders what to do.