We need a little patience

In the last year I’ve been doing a lot more blogging, and it’s sometimes hard for me to keep track of everything I’ve written. So I’m going to start reposting things here, in order to keep track. This is a repost of something I wrote for Fordham and for C-SAIL last week. So if you read it there, no need to read again!

It’s 2017, which means we’re in year six of the Common Core experiment. The big question that everyone wants the answer to is “Is Common Core working?” Many states seem poised to move in a new direction, especially with a new administration in Washington, and research evidence could play an instrumental role in helping states make the decision of whether to keep the standards, revise them, or replace them altogether. (Of course, it might also be that policymakers’ views on the standards are impervious to evidence.)

To my knowledge, there are two existing studies that try to assess Common Core’s impact on student achievement, both by Tom Loveless. They compare state NAEP gains between Common Core adopting and non-adopting states or compare states based on an index of the quality of their implementation of the standards. Both studies find, in essence, no effects of the standards, and the media have covered these studies using that angle. The C-SAIL project, on which I am co-principal investigator, is also considering a related question (in our case, we are asking about the impact of college- and career-readiness standards in general, including, but not limited to, the Common Core standards).

There are many challenges with doing this kind of research. A few of the most serious are:

  1. The need to use sophisticated quasi-experimental methods, since experimental methods are not available.
  2. The limited array of outcome variables available, since NAEP (which is not perfectly aligned to the Common Core) is really the only assessment that has the national comparability required and many college and career outcomes are difficult to measure.
  3. The fact that the timing of policy implementation is not clear when states varied so much in the timing of related policies like assessment and textbook adoptions.

Thus, it is not obvious when will be the right time to evaluate the policy, and with what outcomes.

Policymakers want to effect positive change through policy, and they often need to make decisions on a short cycle—after all, they often make promises in their elections, and it behooves them to show evidence that their chosen policies are working in advance of the next round of elections. The consequence is that there is a high demand for rapid evidence about policy effects, and the early evidence often contributes overwhelmingly to shaping the narrative about whether policies are working or not.

Unfortunately, there are more than a handful of examples where the early evidence on a policy turned out to be misleading, or where a policy seemed to have delayed effects. For example, the Gates Foundation’s small school reforms were widely panned as a flop in early reviews relying on student test scores, but a number of later rigorous studies showed (sometimes substantial) positive effects on outcomes such as graduation and college enrollment. It was too late, however—the initiative had already been scrapped by the time the positive evidence started rolling in.

No Child Left Behind acquired quite a negative reputation over its first half dozen years of implementation. Its accountability policies were seen as poorly targeted (they were), and it was labeled as encouraging an array of negative unintended consequences. These views quickly became well established among both researchers and policymakers. And yet, a series of recent studies have shown meaningful effects of the law on student achievement, which has done precisely zero to change public perception.

There are all manner of policies that may fit into this category to a greater or lesser extent. A state capacity building and technical assistance policy implemented in California was shelved after a few years, but evaluations found the policy improved student learning. Several school choice studies have found null or modest effects on test scores only to turn up impacts on longer-term outcomes like graduation. Even School Improvement Grants and other turnaround strategies may qualify in this category—though the recent impact evaluation was neutral, several studies have found positive effects and many have found impacts that grow as the years progress (suggesting that longer-term evaluations may yet show effects).

How does this all relate back to Common Core and other college- and career-readiness standards? There are implications for both researchers and policymakers.

For researchers, these patterns suggest that great care needs to be taken in interpreting and presenting the results of research conducted early in the implementation of Common Core and other policies. This is not to say that researchers should not investigate the early effects of policies, but rather that they should be appropriately cautious in describing what their work means. Early impact studies will virtually never provide the “final answer” as to the effectiveness of any given policy, and researchers should explicitly caution against the interpretation of their work as such.

For policymakers, there are at least two implications. First, when creating new policies, policymakers should think about both short- and long-term outcomes that are desired. Then, they should build into the law ample time before such outcomes can be observed (i.e., ensuring that decisions are not made before the law can have its intended effects). Even if this time is not explicitly built into the policy cycle, policymakers should at least be aware of these issues and adopt a stance of patience toward policy revisions. Second, to the extent that policies build in funds or plans for evaluation, these plans should include both short- and long-term evaluations.

Clearly, these suggestions run counter to prevailing preferences for immediate gratification in policymaking, but they are essential if we are to see sustained improvement in education. At a minimum, this approach might keep us from declaring failure too soon on policies that may well turn out to be successful. Since improvement through policy is almost always a process of incremental progress, failing to learn all the lessons of new policies may hamstring our efforts to develop better policies later. Finally, jumping around from policy to policy likely contributes to reform fatigue among educators, which may even undermine the success of future unrelated policies. In short, regardless of your particular policy preferences, there is good reason to move on from the “shiny object” approach to education policy and focus instead on giving old and seemingly dull objects a chance to demonstrate their worth before throwing them in the policy landfill.

New evidence that textbooks matter

It’s been six months since I’ve written here. My apologies. In the meantime I’ve written a few pieces elsewhere, such as:

  • Here and here on the problems of “percent proficient” as a measure of school performance. The feds seem to have listened to our open letter, as they are allowing states to use performance indices (and perhaps some transformation of scale scores, though there seems to be disagreement on this point) in school accountability.
  • Here and here on public opinion on education policy and an agenda for the incoming administration (admittedly, written when I thought the incoming administration would be somewhat different than the one that’s shaping up).
  • Here describing just how “common” Common Core states’ standards are.
  • Here discussing challenges with state testing and a path forward.

The main project on which I continue to work, however, is the textbook research. We are out with our first working paper (a version of which was just recently accepted for publication in AERA Open), and a corresponding brief through Brookings’ Evidence Speaks series (on which I am now a contributor).

You should check out the brief and the paper, but the short version of the findings is that we once again identify one textbook–Houghton Mifflin California Math–as producing larger achievement gains than the other most commonly adopted textbooks in California during the period 2008-2013. These gains are in the range .05 to .10 standard deviations, and they persist across multiple grades and years (ours is the longest study we are aware of on this topic). The gains may seem modest, but it is important to remember that they accrue to all students in these grades. Thus, for another policy that focuses only on low-achieving students to achieve the same total achievement effect, the impact would have to be much larger. And of course, as we’ve written elsewhere, the marginal cost of choosing this particular textbook over any other is close to zero (though we actually could not find price lists for the books under study, we know this to be true).

We are excited to have the paper out there after years (literally) of work just pulling the data together. I also presented the results in Sacramento and am optimistic that states may start to listen to the steadily growing drumbeat on the importance of collecting and analyzing data on textbook adoptions.



A letter to the U.S. Department of Education (final signatory list)

This is the final version of the letter, which I submitted today.


July 22, 2016


The Honorable John King

Secretary of the Education Department

400 Maryland Avenue, SW

Washington, D.C. 20202


Dear Mr. Secretary:

The Every Student Succeeds Act (ESSA) marks a great opportunity for states to advance accountability systems beyond those from the No Child Left Behind (NCLB) era. The Act (Section 1111(c)(4)(B)(i)(I)) requires states to use an indicator of academic achievement that “measures proficiency on the statewide assessments in reading/language arts and mathematics.” The proposed rulemaking (§ 200.14) would clarify this statutory provision to say that the academic achievement indicator must “equally measure grade-level proficiency on the reading/language arts and mathematics assessments.”

We write this letter to argue that the Department of Education should not mandate the use of proficiency rates as a metric of school performance under ESSA. That is, states should not be limited to measuring academic achievement using performance metrics that focus only on the proportion of students who are grade-level proficient—rather, they should be encouraged, or at a minimum allowed, to use performance metrics that account for student achievement at all levels, provided the state defines what performance level represents grade level proficiency on its reading/language arts and mathematics assessments.

Moving beyond proficiency rates as the sole or primary measure of school performance has many advantages. For example, a narrow focus on proficiency rates incentivizes schools to focus on those students near the proficiency cut score, while an approach that takes into account all levels of performance incentivizes a focus on all students. Furthermore, measuring performance using the full range of achievement provides additional and useful information for parents, practitioners, researchers, and policymakers for the purposes of decisionmaking and accountability, including more accurate information about the differences among schools.

Reporting performance in terms of the percentage above proficient is problematic in several important ways. Percent proficient:

  1. Incentivizes schools to focus only on students around the proficiency cutoff rather than all students in a school (Booher-Jennings, 2005; Neal & Schanzenbach, 2010). This can divert resources from students who are at lower or higher points in the achievement distribution, some of whom may need as much or more support than students just around the proficiency cut score (Schwartz, Hamilton, Stecher, & Steele, 2011). This has been shown to influence which students in a state benefit (i.e., experience gains in their academic achievement) from accountability regulations (Neal & Schanzenbach, 2010).
  2. Encourages teachers to focus on bringing students to a minimum level of proficiency rather than continuing to advance student learning to higher levels of performance beyond proficiency.
  3. Is not a reliable measure of school performance. For example, percent proficient is an inappropriate measure of progress over time because changes in proficiency rates are unstable and measured with error (Ho, 2008; Linn, 2003; Kane & Staiger, 2002). The percent proficient is also dependent upon the state-determined cut score for proficiency on annual assessments (Ho, 2008), which varies from state to state and over time. Percent proficient further depends on details of the testing program that shouldn’t matter, such as the composition of the items on the state test or the type of method used to set performance standards. These problems are compounded in small schools or in subgroups that are small in size.
  4. Is a very poor measure of performance gaps between subgroups, because percent proficient will be affected by how a proficiency cut score on the state assessments is chosen (Ho, 2008; Holland, 2002). Indeed, prior research suggests that using percent proficient can even reverse the sign of changes in achievement gaps over time relative to if a more accurate method is used (Linn, 2007).
  5. Penalizes schools that serve larger proportions of low-achieving students (Kober & Riddle, 2012) as schools are not given credit for improvements in performance other than the move to proficiency from not-proficient.

We suggest two practices for measuring achievement that lessen or avoid these problems. Importantly, some of these practices were utilized by states in ESEA Flexibility Waivers and are improvements to NCLB practices (Polikoff, McEachin, Wrabel, & Duque, 2014).

Average Scale Scores

The best approach for measuring student achievement levels for accountability purposes under ESSA is to use average scale scores. Rather than presenting performance as the proportion of students who have met the minimum-proficiency cut score, states could present the average (mean) score of students within the school and the average performance of each subgroup of students. If the Department believes percent proficient is also important for reporting purposes, these values could be reported alongside the average scale scores.

The use of mean scores places the focus on improving the academic achievement of all students within a school and not just those whose performance is around the state proficiency cut score (Center for Education Policy, 2011). Such a practice also increases the amount of variation in school performance measures each year, providing for improved differentiation between schools that may have otherwise similar proficiency rates. In fact Ho (2008) argues if a single rating is going to be used for reporting on performance, it should be a measure of the average performance because such measures incorporate the value of every score (student) into the calculation and the average can be used for more advanced analyses. The measurement of gaps between key demographic groups of students, a key goal of the ESSA law, is dramatically improved with the use of average scores rather than the proportion of proficient students (Holland, 2002; Linn, 2007).

Proficiency Indexes

If average scale scores cannot be used, a weaker alternative that is still superior to percent proficient would be to allow states to use proficiency indexes. Schools under this policy would be allocated points based on multiple levels of performance. For example, a state could identify four levels of performance on annual assessments: Well Below Proficient, Below Proficient, Proficient, and Advanced Proficient. Schools receive no credit for students Well Below Proficient, partial credit for students who are Below Proficient, full credit for students reaching Proficiency, and additional credit for students reaching Advanced Proficiency. Here we present an example using School A and School B.

Proficiency Index Example
School A School B
Proficiency Category (A)
Points Per Student
# of Students
Index Points
Points Per Student
# of Students
Index Points
Well Below Proficient 0.0 27 0.0 0.0 18 0.0
Below Proficient 0.5 18 9.0 0.5 27 13.5
Proficient 1.0 33 33.0 1.0 26 26.0
Advanced Proficient 1.5 22 33.0 1.5 29 43.5
Total 100 75.0 100 83.0
NCLB Proficiency Rate: 55%
ESSA Proficiency Index: 75
NCLB Proficiency Rate: 55%
ESSA Proficiency Index: 83

Under NCLB proficiency rate regulations, both School A and School B would have received a 55% proficiency rate score. Using a Proficiency Index, the performance of these schools would no longer be identical. A state would be able to compare the two schools while simultaneously identifying annual meaningful differentiation in the performance of School A from that of School B. The hypothetical case presented here is not the only way a proficiency index can be used. Massachusetts is one example of a state that has used a proficiency index for the purposes of identifying low-performing schools and gaps between subgroup of students (see: ESEA Flexibility Request: Massachusetts, page 32). These indexes are understandable for practitioners, family members, and administrators while also providing additional information regarding the performance of students who are not grade-level proficient.

The benefits of using such an index, relative to using the proportion of proficient students in a school, is that it incentivizes a focus on all students, not just those around an assessment’s proficiency cut score (Linn, Baker, & Betebenner, 2002). Moreover, schools with large proportions of students well-below the proficiency cut score are given credit for moving students to higher levels of performance even if still below the cut score (Linn, 2003). The use of a proficiency index or providing schools credit for students at different points in the achievement distribution improves the construct validity of the accountability measures over the NCLB proficiency rate measures (Polikoff et al., 2014). In other words, the inferences made about schools (e.g., low-performing or bottom 5%) using the proposed measures are more appropriate than those made using proficiency rates alone.

What We Recommend

Given the findings cited above, we believe the Department of Education should revise its regulations to one of two positions:

  • Explicitly endorsing or encouraging states to use one of the two above-mentioned approaches as an alternative to proficiency rates as the primary measure of school performance. Average scale scores is the superior method.
  • Failing that, clarifying that the law is neutral about the use of proficiency rates versus one of the two above-mentioned alternatives to proficiency rates as the primary measure of school performance.

With the preponderance of evidence showing that schools and teachers respond to incentives embedded in accountability systems, we believe option 1 is the best choice. This option leaves states the authority to determine school performance how they see fit but encourages them to incorporate what we have learned through research about the most accurate and appropriate way to measure school performance levels.

Our Recommendation is Consistent with ESSA

Section 1111(c)(4)(A)) of ESEA, as amended by ESSA, requires each state to establish long-term goals:

“(i) for all students and separately for each sub- group of students in the State—

(I) for, at a minimum, improved—

(aa) academic achievement, as measured by proficiency on the annual assessments required under subsection (b)(2)(B)(v)(I);”

And Section 1111(c)(4)(B) of ESEA requires the State accountability system to have indicators that are used to differentiate all public schools in the State, including—(i) “academic achievement—(I) as measured by proficiency on the annual assessments required [under other provisions of ESSA].”

Our suggested approach is supportable under these provisions based on the following analysis. The above-quoted provisions in the law that mandate long-term goals and indictors of student achievement based on proficiency on annual assessments do not prescribe how a state specifically uses the concept of proficient performance on the state assessments. The statute does not prescribe that “proficiency” be interpreted to compel differentiation of schools based exclusively on “proficiency rates.” Proficiency is commonly taken to mean “knowledge” or “skill” (Merriam Webster defines it as “advancement in knowledge or skill” or “the quality or state of being proficient”, where “proficient” is defined as “well advanced in an art, occupation, or branch of knowledge”). Under either of these definitions, an aggregate performance measure such as the two options described above would clearly qualify as involving a measure of proficiency. Both of the above-mentioned options provide more information about the average proficiency level of a school than an aggregate proficiency rate. Moreover, they address far more effectively than proficiency rates the core purposes of ESSA, including incentivizing more effective efforts to educate all children and providing broad discretion to states in designing their accountability systems.

We would be happy to provide more information on these recommendations at your pleasure.


Morgan Polikoff, Ph.D., Associate Professor of Education, USC Rossier School of Education


Educational Researchers and Experts

Alice Huguet, Ph.D., Postdoctoral Fellow, School of Education and Social Policy, Northwestern University

Andrew Ho, Ph.D., Professor of Education, Harvard Graduate School of Education

Andrew Saultz, Ph.D., Assistant Professor, Miami University (Ohio)

Andrew Schaper, Ph.D., Senior Associate, Basis Policy Research

Anna Egalite, Ph.D., Assistant Professor of Education, North Carolina State University

Arie van der Ploeg, Ph.D., retired Principal Researcher, American Institutes for Research

Cara Jackson, Ph.D., Assistant Director of Research & Evaluation, Urban Teachers

Christopher A. Candelaria, Ph.D., Assistant Professor of Public Policy and Education, Vanderbilt University

Cory Koedel, Ph.D., Associate Professor of Economics and Public Policy, University of Missouri

Dan Goldhaber, Ph. D., Director, Center for Education Data & Research, University of Washington Bothell

Danielle Dennis, Ph.D., Associate Professor of Literacy Studies, University of South Florida

Daniel Koretz, Ph.D., Henry Lee Shattuck Professor of Education, Harvard Graduate School of Education

David Hersh, Ph.D. Candidate, Rutgers University Bloustein School of Planning and Public Policy

David M. Rochman, Research and Program Analyst, Moose Analytics

Edward J. Fuller, Ph.D., Associate Professor of Education Policy, The Pennsylvania State University

Eric A. Houck, Associate Professor of Educational Leadership and Policy, University of North Carolina at Chapel Hill

Eric Parsons, Ph.D., Assistant Research Professor, University of Missouri

Erin O’Hara, former Assistant Commissioner for Data & Research, Tennessee Department of Education

Ethan Hutt, Ph.D., Assistant Professor of Education, University of Maryland College Park

Eva Baker, Ed.D., Distinguished Research Professor, UCLA Graduate School of Education and Information Studies, Director, Center for Research on Evaluation, Standards, and Student Testing, Past President, American Educational Research Association

Greg Palardy, Ph.D., Associate Professor, University of California, Riverside

Heather J. Hough, Ph.D., Executive Director, CORE-PACE Research Partnership

Jason A. Grissom, Ph.D., Associate Professor of Public Policy and Education, Vanderbilt University

Jeffrey Nellhaus, Ed.M., Chief of Assessment, Parcc Inc., former Deputy Commissioner, Massachusetts Department of Elementary and Secondary Education

Jeffrey W. Snyder, Ph.D., Assistant Professor, Cleveland State University

Jennifer Vranek, Founding Partner, Education First

John A. Epstein, Ed.D., Education Associate Mathematics, Delaware Department of Education

John Q. Easton, Ph.D., Vice President, Programs, Spencer Foundation, former Director, Institute of Education Sciences

John Ritzler, Ph.D., Executive Director, Research & Evaluation Services, South Bend Community School Corporation

Jonathan Plucker, Ph.D., Julian C. Stanley Professor of Talent Development, Johns Hopkins University

Joshua Cowen, Ph.D., Associate Professor of Education Policy, Michigan State University

Katherine Glenn-Applegate, Ph.D., Assistant Professor of Education, Ohio Wesleyan University

Linda Darling-Hammond, Ed.D., President, Learning Policy Institute, Charles E. Ducommun Professor of Education Emeritus, Stanford University, Past President, American Educational Research Association

Lindsay Bell Weixler, Ph.D., Senior Research Fellow, Education Research Alliance for New Orleans

Madeline Mavrogordato, Ph.D., Assistant Professor, K-12 Educational Administration, Michigan State University

Martin R. West, Ph.D., Associate Professor, Harvard Graduate School of Education

Matt Chingos, Ph.D., Senior Fellow, Urban Institute

Matthew Di Carlo, Ph.D., Senior Fellow, Albert Shanker Institute

Matthew Duque, Ph.D., Data Strategist, Baltimore County Public Schools

Matthew A. Kraft, Ed.D., Assistant Professor of Education and Economics, Brown University

Michael H. Little, Royster Fellow and Doctoral Student, University of North Carolina at Chapel Hill

Michael Hansen, Ph.D., Senior Fellow and Director, Brown Center on Education Policy, Brookings Institution

Michael J. Petrilli, President, Thomas B. Fordham Institute

Nathan Trenholm, Director of Accountability and Research, Clark County (NV) School District

  1. Tiên Lê, Doctoral Fellow, USC Rossier School of Education

Raegen T. Miller, Ed.D., Research Fellow, Georgetown University

Russell Brown, Ph.D., Chief Accountability Officer, Baltimore County Public Schools

Russell Clement, Ph.D., Research Specialist, Broward County Public Schools

Sarah Reckhow, Ph.D., Assistant Professor of Political Science, Michigan State University

Sean P. “Jack” Buckley, Ph.D., Senior Vice President, Research, The College Board, former Commissioner of the National Center for Education Statistics

Sherman Dorn, Ph.D., Professor, Mary Lou Fulton Teachers College, Arizona State University

Stephani L. Wrabel, Ph.D., USC Rossier School of Education

Thomas Toch, Georgetown University

Tom Loveless, Ph.D., Non-resident Senior Fellow, Brookings Institution


K-12 Educators

Alexander McNaughton, History Teacher, YES Prep Charter School, Houston, TX

Andrea Wood Reynolds, District Testing Coordinator, Northside ISD, TX

Angela Atkinson Duina, Ed.D., Title I School Improvement Coordinator, Portland Public Schools, ME

Ashley Baquero, J.D., English/Language Arts Teacher, Durham, NC

Brett Coffman, Ed.S., Assistant Principal, Liberty High School, MO

Callie Lowenstein, Bilingual Teacher, Washington Heights Expeditionary Learning School, NY

Candace Burckhardt, Special Education Coordinator, Indigo Education

Daniel Gohl, Chief Academic Officer, Broward County Public Schools, FL

Danielle Blue, M.Ed., Director of Preschool Programming, South Kingstown Parks and Recreation, RI

Jacquline D. Price, M.Ed., County School Superintendent, La Paz County, AZ

Jennifer Taubenheim, Elementary Special Education Teacher, Idaho Falls, ID

Jillian Haring, Staff Assistant, Broward County Public Schools, FL

Juan Gomez, Middle School Math Instructional Coach Carmel High School, Carmel, CA

Mahnaz R. Charania, Ph.D., GA

Mary F. Johnson, MLS, Ed.D., Retired school librarian

MaryEllen Falvey, M.Ed, NBCT, Office of Academics, Broward County Public Schools, FL

Meredith Heikes, 6th grade STEM teacher, Quincy School District, WA

Mike Musialowski, M.S., Math/Science Teacher, Taos, NM

Misty Pier, Special Education Teacher, Eagle Mountain Saginaw ISD, TX

Nell L. Forgacs, Ed.M., Educator, MA

Oscar Garcia, Social Studies Teacher, El Paso Academy East, TX

Patricia K. Hadley, Elementary School Teacher, Retired, Twin Falls, ID

Samantha Arce, Elementary Teacher, Phoenix, AZ

Theodore A. Hadley, High School/Middle School Teacher, Retired, Twin Falls, ID
Tim Larrabee, M.Ed., MAT, Upper Elementary Teacher, American International School of Utah

Troy Frystak, 5/6 Teacher, Springwater Environmental Sciences School, OR


Other Interested Parties

Arnold F. Shober, Ph.D., Associate Professor of Government, Lawrence University

Celine Coggins, Ph. D., Founder and CEO, Teach Plus

David Weingartner, Co-Chair Minneapolis Public Schools 2020 Advisory Committee

Joanne Weiss, former chief of staff to U.S. Secretary of Education Arne Duncan

Justin Reich, EdD, Executive Director, Teaching Systems Lab, MIT

Karl Rectanus, CEO, Lea(R)n, Inc.

Kenneth R. DeNisco, Ph.D., Associate Professor, Physics & Astronomy, Harrisburg Area Community College

Kimberly L. Glass, Ph.D., Pediatric Neuropsychologist, The Stixrud Group

Mark Otter, COO, VIF International Education

Patrick Dunn, Ph.D., Biomedical Research Curator, Northrop Grumman TS

Robert Rothman, Education Writer, Washington, DC

Steven Gorman, Ph.D., Program Manager, Academy for Lifelong Learning, LSC-Montgomery

Torrance Robinson, CEO, trovvit


Booher-Jennings, J. (2005). Below the bubble: “Educational triage” and the Texas accountability system. American Educational Research Journal, 42(1), 231–268.

Center on Education Policy. (2011, May 3). An open letter from the Center on Education Policy to the SMARTER Balanced Assessment Consortium and the Partnership for Assessment of Readiness for College and Career. Retrieved from http://cep-dc.org/displayDocument.cfm?DocumentID=359

Ho, A. D. (2008). The problem with “proficiency”: Limitations of statistics and policy under No Child Left Behind. Educational Researcher, 37(6), 351–360.

Holland, P. W. (2002). Two measures of change in the gaps between the CDFs of test-score distributions. Journal of Educational Behavioral Statistics, 27(1), 3–17.

Kober, N., & Riddle, W. (2012). Accountability issues to watch under NCLB waivers. Washington, DC: Center on Education Policy.

Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3–13.

Linn, R. L. (2007). Educational accountability systems. Paper presented at the The CRESST Conference: The Future of Test-Based Educational Accountability.

Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of requirements of the No Child Left Behind Act of 2001. Educational Researcher, 31(6), 3–16.

Neal, D., & Schanzenbach, D. W. (2010). Left behind by design: Proficiency counts and test-based accountability. Review of Economics and Statistics, 92, 263–283.

Ng, H. L., & Koretz, D. (2015). Sensitivity of school-performance ratings to scaling decisions. Applied Measurement in Education, 28(4), 330–349.

Polikoff, M. S., McEachin, A., Wrabel, S. L., & Duque, M. (2014). The waive of the future? School accountability in the waiver era. Educational Researcher, 43(1), 45–54. http://doi.org/10.3102/0013189X13517137

Schwartz, H. L., Hamilton, L. S., Stecher, B. M., & Steele, J. L. (2011). Expanded measures of school performance. Santa Monica, CA: The RAND Corporation.






More evidence that the test matters

Well, it’s been two months since my last post. In those two months, a lot has happened. I’ve continued digging into the textbook adoption data (this was covered on an EdWeek blog and I also wrote about it for Brookings). Fordham also released their study of the content and quality of next-generation assessments, on which I was a co-author (see my parting thoughts here). Finally, just last week I was granted tenure at USC. So I’ve been busy and haven’t written here as much as I should.

Today I’m writing about a new article of mine that’s just coming out in Educational Assessment (if you want a copy, shoot me an email). This is the last article I’ll write using the Measures of Effective Teaching data (I previously wrote here and here using these data). This paper asks a very simple question: looking across the states in the MET sample, is there evidence that the correlations of observational and student survey measures with teacher value-added vary systematically? In other words, are the tests used in these states differentially sensitive to these measures of instructional quality?

This is an important question for many reasons. Most obviously, we are using both value-added scores and instructional quality measures (observations, surveys) for an increasingly wide array of decisions, both high- and low-stakes. For any kind of decision we want to make, we want to be able to confidently say that the assessments used for value-added are sensitive to the kinds of instructional practices we think of as being “high quality.” Otherwise, for instance, it is hard to imagine how teachers could be expected to improve their value-added through professional development opportunities (i.e., if no observed instructional measures predict value-added, how can we expect teachers to improve their value added?). The work is also important because, to the extent that we see a great deal of variation across states/tests in sensitivity to instruction, it may necessitate greater attention to the assessments themselves in both research and policy [1]. As I argue in the paper, the MET data are very well suited to this kind of analysis, because there were no stakes (and thus limited potential for gaming).

The methods for investigating the question are very straightforward–basically I just correlate or regress value-added estimates from the MET study on teacher observation scores and student survey scores separately by state. Where I find limited or no evidence of relationships, I dig in further by doing things like pulling out outliers, exploring nonlinear relationships, and determining relationships at the subscale or grade level.

What I find, and how that should be interpreted, probably depends on where you sit. I do find at least some correlations of value-added with observations and student surveys in each state and subject. However, there is a good deal of state-to-state variation. For instance, in some states, student surveys correlate with value-added as high as .28 [2], while in other states those correlations are negative (though not significantly different from zero).

Analyzing results at the subscale level–where observational and survey scores are probably most likely to be useful–does not help. Perhaps because subscales are much less reliable than total scores, there are very few statistically significant correlations of subscales with VAM scores, and these too differ by state. If this pattern were to hold in new teacher evaluation systems being implemented in the states, it would raise perplexing questions about what kinds of instruction these value-added scores were sensitive to.

Perhaps the worst offender in my data is state 4 in English language arts (I cannot name states due to data restrictions). For this state, there are no total score correlations of student surveys or any of the observational measures with teacher value-added. There is one statistically significant correlation at a single grade level, and there is also one statistically significant correlation for a single subscale on one observational instrument. But otherwise, the state ELA tests in this state seem to be totally insensitive to instructional quality as measured by the Framework for Teaching, the CLASS, and the ELA-specific PLATO (not to mention the Tripod student survey). Certainly it’s possible these tests could be sensitive to some other measures not included in MET, but it’s not obvious to me what those would be (nor is it obvious that new state systems will be implemented as carefully as MET was).

I conclude with extended implications for research and practice. I think this kind of work raises a number of questions, such as:

  1. What is it about the content of these tests that makes some sensitive and others not?
  2. What kind of instruction do we want our tests to be sensitive to?
  3. How sensitive is “sensitive enough?” That is, how big a correlation do we want or need between value-added and instructional measures?
  4. If we want to provide useful feedback to teachers, we need reliable subscores on observational measures. How can we achieve that?

I enjoyed writing this article, and I believe it may well be my longest-term paper from beginning to submission. I hope you find it useful and that it raises additional questions about teacher evaluation moving forward. And I welcome your reactions (though I’m done with MET data, so if you want more analysis, I’m not your man)!


[1] The oversimplified but not-too-far-off summary of most value-added research is that it is almost completely agnostic to the test that’s used to calculate the VAM.

[2] I did not correct the correlations for measurement error, in contrast to the main MET reports.

An awful lot of districts don’t know what textbooks are used in their schools

That’s one of many takeaways of my textbook research so far. I guess to many people this is no surprise, but it seems crazy to me. Knowledge of what is going on inside schools strikes me as the most basic function of the district office. And yet I would estimate around 10% of the districts that have responded to my FOIA requests have said they have no documents listing the textbooks in use, and probably another 30-50% clearly have to invent such a document to satisfy my request [1]. Instead, I get a lot of letters like this:

Thank you for using the [district name] FOIA Center.

The FOIA office has been advised by the appropriate departments that the records you seek are not kept in the normal course of business. That is, a full and complete list of all mathematics and science textbooks currently in use by grade and the year the textbook was first used. As written, this request is categorical and unduly burdensome in nature and would require extensive resources to both search for information, which would most likely require a manual school by school search, and analysis to determine the other data points you are seeking. For these reasons, [district] is denying this request pursuant to [state statute] and invites you to narrow your request to manageable proportions. If [district] does not receive a revised request from you within five (5) business days of this response, this request will be closed.

Apparently to many folks this kind of arrangement is just fine–school sites should be able to decide all this stuff themselves. I can buy the argument that schools should have autonomy over curriculum materials (though I doubt that’s very efficient or good for kids), but even if you believe that’s the case, shouldn’t the district at least track how their money is being spent?

This is one of the research questions that’s emerged over time as I’ve gone through this textbook project, and it’s something I’ll investigate just as soon as I finish this round of FOIAs. My hypothesis? I suspect Ilana Horn is right about the consequences of this kind of non-leadership by districts:

I hope we’re wrong, but I doubt it.


[1] Districts don’t actually have to do this under the letter of FOIA law. So I very much appreciate the efforts.

This study is based upon work supported by the National Science Foundation under Grant No. 1445654 and the Smith Richardson Foundation. Any opinions, findings, and conclusions or recommendations expressed in this study are those of the author(s) and do not necessarily reflect the views of the funders.

Recruiting teachers!


I’m looking to recruit a few teachers (9, specifically) to participate in a study to test survey measures of teachers’ instruction for use in a large national study of standards implementation.
Teachers who participate in the work will be asked to do three things:
  1. Complete a bi-weekly (every other week) log survey describing their instruction in either mathematics or ELA over the course of the spring semester in a target class. The first log will be in either the last week of January or the first week of February. All logs will be online. We expect the first log will take 45 minutes to an hour, but that teachers will get more efficient at completing the logs as they go through the study.
  2. Complete a year-end online survey asking similar questions to the weekly log. This will be done at the end of the school year, likely in June.
  3. Turn in 2 weeks work of (blank) assignments and assessments that are given to students in the target class at any point during the spring semester.
For participating in this activity, we will provide each teacher with a $325 Amazon.com e-gift card. Please note that all work should be done outside teachers’ regular work hours.
The only eligibility requirements are:
  1. Teach a mathematics or ELA class in grades K-12.
  2. Work in a public (including charter) or private school that does not require separate research approval.
If you are interested in participating, please email me at my last name at usc.edu. Please share widely!

The Reports of Common Core’s Death Have Been Greatly Exaggerated

This evening, I happened upon an article from the Associated Press, noting that West Virginia’s State Board of Education had repealed Common Core in the state. (Note: Common Core had already been renamed the K-12 Next Generation Standards in the state.) The new standards, called the West Virginia College- and Career-Readiness Standards, are available here: (ELA and mathematics). Another article presents the major changes as follows (with my snark in bold):

· Simplify the presentation of standards for teachers and parents (I guess sequential numbering is more simplified…?)

· Increase prevalence of problem-solving skills with a connection to college, careers and life-needed skills

· Align standards for more grade level appropriateness for all standards at all grade levels (No clue what this refers to… maybe the insertion of “with prompting and support” in a few K ELA standards?)

· Include clarifying examples within each standard to make them more relevant to learning (Most already had examples. A few standards do now have additional “instructional notes”)

· Include an introduction of foundational skills in ELA and mathematics to ensure mastery of content in future grade levels

· Include handwriting in grades K-4, and explicit mention of cursive writing instruction in grades 2-3 (Handwriting is great! Mandatory cursive remains an absurd policy.)

· Include an explicit mention for students to learn multiplication (times) tables by the end of grade 3

· Add standards specific to Calculus with the expectation of Calculus being available to all students (Yeah, no one is taking calculus in high school since Common Core.)

Increased emphasis on handwriting is indeed an addition, as are cursive and calculus. These are changes that other states have made too. Adding multiplication tables is not an addition to Common Core (I don’t know where this myth came from that Common Core doesn’t require multiplication facts by the end of 3rd grade: “By the end of Grade 3, know from memory all products of two one-digit numbers.“)

But if you actually go and read the new standards, they are almost verbatim the same as Common Core in most cases. In 3rd grade math, aside from the addition of “speed” to the requirement for fluency with times tables, the standards are Common Core (with two exceptions that I saw: West Virginia’s new standards sometimes add clarifying instructional notes, and West Virginia’s new standards replace the words “for example” with “e.g.”). Oh, and the standards have been renumbered, thus making crosswalks with textbooks or websites more complicated.

I know, as someone who actually likes Common Core and wants it to stick around, that I probably shouldn’t even be writing about this. I should probably sit quietly while the state attempts to pull a fast one on its populace. But this is *so* dumb that I felt obliged to say something.

It’s *so* dumb to waste even one cent of taxpayer money on Common Core commissions in state after state, each resulting in virtually identical standards to the much-loathed Common Core.

It’s *so* dumb to keep the same standards but renumber them, making things needlessly more complicated for teachers and providing absolutely no benefit.

It’s *so* dumb to rename the standards twice but leave the content unchanged, all in an to attempt to fool the hysterical masses.

It’s *so* dumb to report on these kinds of changes as if they are “repeals” when they are nothing of the sort.

Rather than doing dumb things like these, here are some suggestions for how these situations might be better handled (admittedly these are probably naïve, because I’m blessed not to have to deal with crazy people for my livelihood):

  • If your citizens believe nonsensical things about Common Core that aren’t true, you should correct their misunderstandings. You should not feed those nonsensical beliefs for political gain [1].
  • If you think the standards are good enough to keep almost verbatim, then defend the standards rather than running from them. 
  • If you don’t think the standards are good enough to keep, then don’t keep them! Get smart people together and do a legitimate rewrite.
  • If you leave the standards open for public comment for months and you get virtually no comments based on any discernible evidence, the standards are probably pretty good.
  • If your citizens are so gullible that they will fall for such transparently obvious ploys, you’ve got problems with the gullibility of your citizenry (which might be mitigated with better standards and instruction).

So, West Virginia’s kids will still be learning Common Core standards come 2016. They’ll also be learning cursive (and a few of them will be learning calculus (which they would have anyway, because obviously)). And what the citizens of the state will be learning–or would be if they paid any attention to what’s happening–is that their government prefers to lie to them for the sake of appeasement than it does to defend its policies as to what’s best for the state. For many reasons, that’s the wrong kind of lesson to be teaching.


[1] The first part of this sentence applies mostly to the right. The second part applies to both extremes of the political spectrum.