Life is a series of tradeoffs. Perhaps nowhere in education is that clearer than in assessment policy.
What brings this to mind are Motoko Rich’s and Catherine Gewertz’s recent articles about scoring Common Core tests. I think both of these articles are good, and they both illustrate some of the challenges of doing what we’re trying to do at scale. But it’s also clear that some anti-test folks are using these very complicated issues as fodder for their agendas, and that’s disappointing (if totally expected). Here are some of the key quotes from Motoko’s article, and the tradeoffs they illustrate.
On Friday, in an unobtrusive office park northeast of downtown here, about 100 temporary employees of the testing giant Pearson worked in diligent silence scoring thousands of short essays written by third- and fifth-grade students from across the country. There was a onetime wedding planner, a retired medical technologist and a former Pearson saleswoman with a master’s degree in marital counseling. To get the job, like other scorers nationwide, they needed a four-year college degree with relevant coursework, but no teaching experience. They earned $12 to $14 an hour, with the possibility of small bonuses if they hit daily quality and volume targets.
Tradeoff: We think we want teachers to be involved in the scoring of these tests (presumably because we believe there is some special expertise that teachers possess) . But teachers cost more than $12 to $14 an hour, and we’re in an era where every dollar spent on testing is endlessly scrutinized, so we have to instead use some educated people who are not teachers.
At times, the scoring process can evoke the way a restaurant chain monitors the work of its employees and the quality of its products. “From the standpoint of comparing us to a Starbucks or McDonald’s, where you go into those places you know exactly what you’re going to get,” said Bob Sanders, vice president of content and scoring management at Pearson North America, when asked whether such an analogy was apt.
Tradeoff: We have a huge system in this country, and we want results that are comparable across schools. But comparability in a large system requires some degree of standardization, and standardization at that level of scale requires processes that look, well, standardized and corporate.
For exams like the Advanced Placement tests given by the College Board, scorers must be current college professors or high school teachers who have at least three years of experience teaching the subject they are scoring.
Tradeoff: We want to test everyone. This means the volume for scoring is tremendously larger than the AP exam (about 12 million test takers vs. about 1 million), which again means we may not be able to find enough teachers to do the work.
“You’re asking people still, even with the best of rubrics and evidence and training, to make judgments about complex forms of cognition,” Mr. Pellegrino said. “The more we go towards the kinds of interesting thinking and problems and situations that tend to be more about open-ended answers, the harder it is to get objective agreement in scoring.”
Tradeoff: We want more challenging, open-ended, complex tasks. But scoring those tasks at scale is harder to do reliably.
There are of course other big tradeoffs that aren’t highlighted in these articles. For instance:
- The tradeoff between test cost and transparency–building items is very expensive, so releasing items and having to create new ones every year would add to test costs while enhancing transparency.
- The tradeoff between testing time and the nature of the task–multiple choice items are quicker to complete, but they may not fully tap the kinds of skills we want to measure.
- The tradeoff between testing time and the comprehensiveness of the assessment–shorter tests can probably give us a reasonable estimate of overall math and reading proficiency, but they will not give us the fine-grained, actionable data we might want to make instructional responses (and they might contribute to “narrowing the curriculum” if they repeatedly sample the same content).
- The tradeoffs of open-response items with fast scoring–multiple choice items, especially on computers, can be scored virtually instantaneously, whereas open-response items take time to score. So faster feedback may butt up against our desire for better items.
- The tradeoffs associated with testing on computers–e.g., using money to purchase computers vs. other things, advantages of adaptive testing vs. needing to teach kids how to take tests on computers.
I will also note that this kind of reporting could, in my mind, be strengthened with more empirical evidence. For instance,
“Even as teachers, we’re still learning what the Common Core state standards are asking,” Ms. Siemens said. “So to take somebody who is not in the field and ask them to assess student progress or success seems a little iffy.”
Are teachers better scorers than non-teachers, or not? That’s an empirical question. I would be reasonably confident that Pearson has in place a good process for determining who are the best scorers from the standpoint of reliability. Some of the best scorers are teachers, and some are not.
Some teachers question whether scorers can grade fairly without knowing whether a student has struggled with learning difficulties or speaks English as a second language.
Is there evidence that the test scoring is biased against students with disabilities or ELLs, or not? That’s also an empirical question. Again I would guess that Pearson has in place a process to weed out construct-irrelevant variance to the maximum extent possible.
Overall, I think it’s great that writers like Motoko and Catherine are tackling these challenging issues. But I hope it’s not lost on readers that, like everything in life, testing requires tradeoffs that are not easily navigated.
 It’s not obvious to me this is true, though it may well be. Regardless, it would likely be a good professional development opportunity to score items.