Should California’s New Accountability Model Set the Bar for Other States?

This is a repost of something I published previously on the C-SAIL blog and at FutureEd.

California has released a pilot version of its long-awaited school and district performance dashboard under the federal Every Student Succeeds Act. The dashboard takes a dramatically different approach from prior accountability systems, signaling a sharp break with both the No Child Left Behind era and California’s past approach.

Not surprisingly, given the contentiousness of measuring school performance, it has drawn criticism (too many measures, a lack of clear goals and targets, the possibility for schools to receive high scores even with underperforming student groups) and praise (a fairer and more accurate summary of school performance, a reduced reliance on test scores).

I’m not exactly a neutral observer. Over the past year and a half, I played a role in the development of the dashboard as part of the state superintendent’s Accountability and Continuous Improvement Task Force, an advisory group that put forward many of the features in the new dashboard. In my view, both the dashboard’s supporters and its opponents are right.

The dashboard is clearly an intentional response to previous accountability systems’ perceived shortcomings in at least four ways:

  • California officials felt state accountability systems focused excessively on test scores under NCLB, to the neglect of other measures of school performance. In response, the new dashboard includes a wider array of non-test measures, such as chronic absenteeism and suspension rates.
  • There was a widespread, well-justified concern that prior accountability measures based primarily on achievement levels (proficiency rates) unfairly penalized schools serving more disadvantaged students and failed to reward schools for strong test score growth. (See a previous post for more on this.) In response, the new dashboard includes both achievement status and growth in its performance measures. And the state uses a more nuanced measure of status rather than merely the percent of students who are proficient.
  • California’s previous metric, the Academic Performance Index, boiled down a school’s performance to a single number on a 200-to-1000 scale. Officials creating the state’s new system believed this to be an unhelpful way to think about a school’s performance. In response, the new system offers dozens of scores but no summative rating.
  • There was near unanimity among the task force members (myself excluded), the State Board of Education, and the California Department of Education that NCLB-era accountability systems were excessively punitive, and that the focus should instead be on “continuous improvement,” rather than “test-and-punish.” As a result, the new California system is nearly silent on actual consequences for schools that don’t meet expectations.

For my money, the pilot dashboard has several valuable features. The most important of these is the focus on multiple measures of school performance. Test scores are important and should play a central role, but schools do much more than teach kids content, and we should start designing our measurement systems to be more in line with what we want schools to be doing. The pilot also rightly places increased emphasis on how their students learn in the course of a school year, regardless of where they start the year on the achievement spectrum. Finally, I appreciate that the state is laying out a theory of action for how California will improve its schools and aligning the various components of its policy systems with this theory.

Still, I have concerns about some of the choices made in the creation of the dashboard.

Most importantly, consequential accountability was left out of the task force conversation entirely. We were essentially silent on the important question of what to do when schools fail their students.

And while consequences for underperforming schools were a topic of discussion at the State Board of Education—and thus I am confident that the state will comply with federal and state law about identifying and intervening in the lowest-performing schools—I am skeptical that the state will truly hold persistently underperforming schools accountable in a meaningful way (e.g., through closure, staffing changes, charter conversion, or other consequences other than “more support”). The new dashboard does not even allow stakeholders to directly compare the performance of schools, diminishing any potential public accountability.

It was a poor decision to roll out a pilot system that makes it essentially impossible to compare schools. Parents want these tools specifically for their comparative purposes, so to not even allow this functionality is a mistake. Some organizations, such as EdSource, have stepped in to fill this gap, but the state should have led this from the start.

And while the state does have a tool that allows for some comparisons within school districts, it is clunky and cumbersome compared to information systems in other states and districts available today. The most appropriate comparison tool might be a sort of similar-schools index that compares each school to other schools with similar demographics (the state used to have this). I understand the state has plans to address this issue when it revises the system; making these comparison tools clear and usable is essential.

Also, while I understand the rationale for not offering a single summative score for a school, I think that some level of aggregation would improve on what’s available now. For example, overall scores for growth and performance level might be useful, in addition to an overall attendance/culture/climate score. The correct number of scores may not be one (though there is some research suggesting that is a more effective approach), but it is unlikely to be dozens.

Finally, the website (which, again, is a pilot) is simply missing the supporting documents needed for a parent to meaningfully engage with and make sense of the data. There is a short video and PDF guide, but these are not adequate. There is also a lack of quality translation (Google Translate is available, but the state should invest in high quality translations given the diversity of California’s citizens). Presumably the documentation will be improved for the final version.

These lessons focus primarily on the transparency of the systems, but this is just one of several principles that states should attend to (which I have offered previously): Accountability systems should actually measure school effectiveness, not just test scores. They should be transparent, allowing stakeholders to make sense of the metrics. And they should be fair, not penalizing schools for factors outside their control. As states, including California, work to create new accountability systems under ESSA, they should use these principles to guide their decisions.

It is critically important to give states some leeway as they make decisions about accountability under ESSA, to allow them to develop their own theory of action and to innovate within the confines of what’s allowed by law. I am pleased that California has put forward a clear theory of action and is employing a wide array of measures to gauge student effectiveness. However, when the dashboard tells us that a school is persistently failing to improve outcomes for its children, I hope the state is serious about addressing that failure. Otherwise, I am skeptical that the dashboard will meaningfully change California’s dismal track record of educational outcomes for its children.

We need a little patience

In the last year I’ve been doing a lot more blogging, and it’s sometimes hard for me to keep track of everything I’ve written. So I’m going to start reposting things here, in order to keep track. This is a repost of something I wrote for Fordham and for C-SAIL last week. So if you read it there, no need to read again!

It’s 2017, which means we’re in year six of the Common Core experiment. The big question that everyone wants the answer to is “Is Common Core working?” Many states seem poised to move in a new direction, especially with a new administration in Washington, and research evidence could play an instrumental role in helping states make the decision of whether to keep the standards, revise them, or replace them altogether. (Of course, it might also be that policymakers’ views on the standards are impervious to evidence.)

To my knowledge, there are two existing studies that try to assess Common Core’s impact on student achievement, both by Tom Loveless. They compare state NAEP gains between Common Core adopting and non-adopting states or compare states based on an index of the quality of their implementation of the standards. Both studies find, in essence, no effects of the standards, and the media have covered these studies using that angle. The C-SAIL project, on which I am co-principal investigator, is also considering a related question (in our case, we are asking about the impact of college- and career-readiness standards in general, including, but not limited to, the Common Core standards).

There are many challenges with doing this kind of research. A few of the most serious are:

  1. The need to use sophisticated quasi-experimental methods, since experimental methods are not available.
  2. The limited array of outcome variables available, since NAEP (which is not perfectly aligned to the Common Core) is really the only assessment that has the national comparability required and many college and career outcomes are difficult to measure.
  3. The fact that the timing of policy implementation is not clear when states varied so much in the timing of related policies like assessment and textbook adoptions.

Thus, it is not obvious when will be the right time to evaluate the policy, and with what outcomes.

Policymakers want to effect positive change through policy, and they often need to make decisions on a short cycle—after all, they often make promises in their elections, and it behooves them to show evidence that their chosen policies are working in advance of the next round of elections. The consequence is that there is a high demand for rapid evidence about policy effects, and the early evidence often contributes overwhelmingly to shaping the narrative about whether policies are working or not.

Unfortunately, there are more than a handful of examples where the early evidence on a policy turned out to be misleading, or where a policy seemed to have delayed effects. For example, the Gates Foundation’s small school reforms were widely panned as a flop in early reviews relying on student test scores, but a number of later rigorous studies showed (sometimes substantial) positive effects on outcomes such as graduation and college enrollment. It was too late, however—the initiative had already been scrapped by the time the positive evidence started rolling in.

No Child Left Behind acquired quite a negative reputation over its first half dozen years of implementation. Its accountability policies were seen as poorly targeted (they were), and it was labeled as encouraging an array of negative unintended consequences. These views quickly became well established among both researchers and policymakers. And yet, a series of recent studies have shown meaningful effects of the law on student achievement, which has done precisely zero to change public perception.

There are all manner of policies that may fit into this category to a greater or lesser extent. A state capacity building and technical assistance policy implemented in California was shelved after a few years, but evaluations found the policy improved student learning. Several school choice studies have found null or modest effects on test scores only to turn up impacts on longer-term outcomes like graduation. Even School Improvement Grants and other turnaround strategies may qualify in this category—though the recent impact evaluation was neutral, several studies have found positive effects and many have found impacts that grow as the years progress (suggesting that longer-term evaluations may yet show effects).

How does this all relate back to Common Core and other college- and career-readiness standards? There are implications for both researchers and policymakers.

For researchers, these patterns suggest that great care needs to be taken in interpreting and presenting the results of research conducted early in the implementation of Common Core and other policies. This is not to say that researchers should not investigate the early effects of policies, but rather that they should be appropriately cautious in describing what their work means. Early impact studies will virtually never provide the “final answer” as to the effectiveness of any given policy, and researchers should explicitly caution against the interpretation of their work as such.

For policymakers, there are at least two implications. First, when creating new policies, policymakers should think about both short- and long-term outcomes that are desired. Then, they should build into the law ample time before such outcomes can be observed (i.e., ensuring that decisions are not made before the law can have its intended effects). Even if this time is not explicitly built into the policy cycle, policymakers should at least be aware of these issues and adopt a stance of patience toward policy revisions. Second, to the extent that policies build in funds or plans for evaluation, these plans should include both short- and long-term evaluations.

Clearly, these suggestions run counter to prevailing preferences for immediate gratification in policymaking, but they are essential if we are to see sustained improvement in education. At a minimum, this approach might keep us from declaring failure too soon on policies that may well turn out to be successful. Since improvement through policy is almost always a process of incremental progress, failing to learn all the lessons of new policies may hamstring our efforts to develop better policies later. Finally, jumping around from policy to policy likely contributes to reform fatigue among educators, which may even undermine the success of future unrelated policies. In short, regardless of your particular policy preferences, there is good reason to move on from the “shiny object” approach to education policy and focus instead on giving old and seemingly dull objects a chance to demonstrate their worth before throwing them in the policy landfill.

New evidence that textbooks matter

It’s been six months since I’ve written here. My apologies. In the meantime I’ve written a few pieces elsewhere, such as:

  • Here and here on the problems of “percent proficient” as a measure of school performance. The feds seem to have listened to our open letter, as they are allowing states to use performance indices (and perhaps some transformation of scale scores, though there seems to be disagreement on this point) in school accountability.
  • Here and here on public opinion on education policy and an agenda for the incoming administration (admittedly, written when I thought the incoming administration would be somewhat different than the one that’s shaping up).
  • Here describing just how “common” Common Core states’ standards are.
  • Here discussing challenges with state testing and a path forward.

The main project on which I continue to work, however, is the textbook research. We are out with our first working paper (a version of which was just recently accepted for publication in AERA Open), and a corresponding brief through Brookings’ Evidence Speaks series (on which I am now a contributor).

You should check out the brief and the paper, but the short version of the findings is that we once again identify one textbook–Houghton Mifflin California Math–as producing larger achievement gains than the other most commonly adopted textbooks in California during the period 2008-2013. These gains are in the range .05 to .10 standard deviations, and they persist across multiple grades and years (ours is the longest study we are aware of on this topic). The gains may seem modest, but it is important to remember that they accrue to all students in these grades. Thus, for another policy that focuses only on low-achieving students to achieve the same total achievement effect, the impact would have to be much larger. And of course, as we’ve written elsewhere, the marginal cost of choosing this particular textbook over any other is close to zero (though we actually could not find price lists for the books under study, we know this to be true).

We are excited to have the paper out there after years (literally) of work just pulling the data together. I also presented the results in Sacramento and am optimistic that states may start to listen to the steadily growing drumbeat on the importance of collecting and analyzing data on textbook adoptions.



A letter to the U.S. Department of Education (final signatory list)

This is the final version of the letter, which I submitted today.


July 22, 2016


The Honorable John King

Secretary of the Education Department

400 Maryland Avenue, SW

Washington, D.C. 20202


Dear Mr. Secretary:

The Every Student Succeeds Act (ESSA) marks a great opportunity for states to advance accountability systems beyond those from the No Child Left Behind (NCLB) era. The Act (Section 1111(c)(4)(B)(i)(I)) requires states to use an indicator of academic achievement that “measures proficiency on the statewide assessments in reading/language arts and mathematics.” The proposed rulemaking (§ 200.14) would clarify this statutory provision to say that the academic achievement indicator must “equally measure grade-level proficiency on the reading/language arts and mathematics assessments.”

We write this letter to argue that the Department of Education should not mandate the use of proficiency rates as a metric of school performance under ESSA. That is, states should not be limited to measuring academic achievement using performance metrics that focus only on the proportion of students who are grade-level proficient—rather, they should be encouraged, or at a minimum allowed, to use performance metrics that account for student achievement at all levels, provided the state defines what performance level represents grade level proficiency on its reading/language arts and mathematics assessments.

Moving beyond proficiency rates as the sole or primary measure of school performance has many advantages. For example, a narrow focus on proficiency rates incentivizes schools to focus on those students near the proficiency cut score, while an approach that takes into account all levels of performance incentivizes a focus on all students. Furthermore, measuring performance using the full range of achievement provides additional and useful information for parents, practitioners, researchers, and policymakers for the purposes of decisionmaking and accountability, including more accurate information about the differences among schools.

Reporting performance in terms of the percentage above proficient is problematic in several important ways. Percent proficient:

  1. Incentivizes schools to focus only on students around the proficiency cutoff rather than all students in a school (Booher-Jennings, 2005; Neal & Schanzenbach, 2010). This can divert resources from students who are at lower or higher points in the achievement distribution, some of whom may need as much or more support than students just around the proficiency cut score (Schwartz, Hamilton, Stecher, & Steele, 2011). This has been shown to influence which students in a state benefit (i.e., experience gains in their academic achievement) from accountability regulations (Neal & Schanzenbach, 2010).
  2. Encourages teachers to focus on bringing students to a minimum level of proficiency rather than continuing to advance student learning to higher levels of performance beyond proficiency.
  3. Is not a reliable measure of school performance. For example, percent proficient is an inappropriate measure of progress over time because changes in proficiency rates are unstable and measured with error (Ho, 2008; Linn, 2003; Kane & Staiger, 2002). The percent proficient is also dependent upon the state-determined cut score for proficiency on annual assessments (Ho, 2008), which varies from state to state and over time. Percent proficient further depends on details of the testing program that shouldn’t matter, such as the composition of the items on the state test or the type of method used to set performance standards. These problems are compounded in small schools or in subgroups that are small in size.
  4. Is a very poor measure of performance gaps between subgroups, because percent proficient will be affected by how a proficiency cut score on the state assessments is chosen (Ho, 2008; Holland, 2002). Indeed, prior research suggests that using percent proficient can even reverse the sign of changes in achievement gaps over time relative to if a more accurate method is used (Linn, 2007).
  5. Penalizes schools that serve larger proportions of low-achieving students (Kober & Riddle, 2012) as schools are not given credit for improvements in performance other than the move to proficiency from not-proficient.

We suggest two practices for measuring achievement that lessen or avoid these problems. Importantly, some of these practices were utilized by states in ESEA Flexibility Waivers and are improvements to NCLB practices (Polikoff, McEachin, Wrabel, & Duque, 2014).

Average Scale Scores

The best approach for measuring student achievement levels for accountability purposes under ESSA is to use average scale scores. Rather than presenting performance as the proportion of students who have met the minimum-proficiency cut score, states could present the average (mean) score of students within the school and the average performance of each subgroup of students. If the Department believes percent proficient is also important for reporting purposes, these values could be reported alongside the average scale scores.

The use of mean scores places the focus on improving the academic achievement of all students within a school and not just those whose performance is around the state proficiency cut score (Center for Education Policy, 2011). Such a practice also increases the amount of variation in school performance measures each year, providing for improved differentiation between schools that may have otherwise similar proficiency rates. In fact Ho (2008) argues if a single rating is going to be used for reporting on performance, it should be a measure of the average performance because such measures incorporate the value of every score (student) into the calculation and the average can be used for more advanced analyses. The measurement of gaps between key demographic groups of students, a key goal of the ESSA law, is dramatically improved with the use of average scores rather than the proportion of proficient students (Holland, 2002; Linn, 2007).

Proficiency Indexes

If average scale scores cannot be used, a weaker alternative that is still superior to percent proficient would be to allow states to use proficiency indexes. Schools under this policy would be allocated points based on multiple levels of performance. For example, a state could identify four levels of performance on annual assessments: Well Below Proficient, Below Proficient, Proficient, and Advanced Proficient. Schools receive no credit for students Well Below Proficient, partial credit for students who are Below Proficient, full credit for students reaching Proficiency, and additional credit for students reaching Advanced Proficiency. Here we present an example using School A and School B.

Proficiency Index Example
School A School B
Proficiency Category (A)
Points Per Student
# of Students
Index Points
Points Per Student
# of Students
Index Points
Well Below Proficient 0.0 27 0.0 0.0 18 0.0
Below Proficient 0.5 18 9.0 0.5 27 13.5
Proficient 1.0 33 33.0 1.0 26 26.0
Advanced Proficient 1.5 22 33.0 1.5 29 43.5
Total 100 75.0 100 83.0
NCLB Proficiency Rate: 55%
ESSA Proficiency Index: 75
NCLB Proficiency Rate: 55%
ESSA Proficiency Index: 83

Under NCLB proficiency rate regulations, both School A and School B would have received a 55% proficiency rate score. Using a Proficiency Index, the performance of these schools would no longer be identical. A state would be able to compare the two schools while simultaneously identifying annual meaningful differentiation in the performance of School A from that of School B. The hypothetical case presented here is not the only way a proficiency index can be used. Massachusetts is one example of a state that has used a proficiency index for the purposes of identifying low-performing schools and gaps between subgroup of students (see: ESEA Flexibility Request: Massachusetts, page 32). These indexes are understandable for practitioners, family members, and administrators while also providing additional information regarding the performance of students who are not grade-level proficient.

The benefits of using such an index, relative to using the proportion of proficient students in a school, is that it incentivizes a focus on all students, not just those around an assessment’s proficiency cut score (Linn, Baker, & Betebenner, 2002). Moreover, schools with large proportions of students well-below the proficiency cut score are given credit for moving students to higher levels of performance even if still below the cut score (Linn, 2003). The use of a proficiency index or providing schools credit for students at different points in the achievement distribution improves the construct validity of the accountability measures over the NCLB proficiency rate measures (Polikoff et al., 2014). In other words, the inferences made about schools (e.g., low-performing or bottom 5%) using the proposed measures are more appropriate than those made using proficiency rates alone.

What We Recommend

Given the findings cited above, we believe the Department of Education should revise its regulations to one of two positions:

  • Explicitly endorsing or encouraging states to use one of the two above-mentioned approaches as an alternative to proficiency rates as the primary measure of school performance. Average scale scores is the superior method.
  • Failing that, clarifying that the law is neutral about the use of proficiency rates versus one of the two above-mentioned alternatives to proficiency rates as the primary measure of school performance.

With the preponderance of evidence showing that schools and teachers respond to incentives embedded in accountability systems, we believe option 1 is the best choice. This option leaves states the authority to determine school performance how they see fit but encourages them to incorporate what we have learned through research about the most accurate and appropriate way to measure school performance levels.

Our Recommendation is Consistent with ESSA

Section 1111(c)(4)(A)) of ESEA, as amended by ESSA, requires each state to establish long-term goals:

“(i) for all students and separately for each sub- group of students in the State—

(I) for, at a minimum, improved—

(aa) academic achievement, as measured by proficiency on the annual assessments required under subsection (b)(2)(B)(v)(I);”

And Section 1111(c)(4)(B) of ESEA requires the State accountability system to have indicators that are used to differentiate all public schools in the State, including—(i) “academic achievement—(I) as measured by proficiency on the annual assessments required [under other provisions of ESSA].”

Our suggested approach is supportable under these provisions based on the following analysis. The above-quoted provisions in the law that mandate long-term goals and indictors of student achievement based on proficiency on annual assessments do not prescribe how a state specifically uses the concept of proficient performance on the state assessments. The statute does not prescribe that “proficiency” be interpreted to compel differentiation of schools based exclusively on “proficiency rates.” Proficiency is commonly taken to mean “knowledge” or “skill” (Merriam Webster defines it as “advancement in knowledge or skill” or “the quality or state of being proficient”, where “proficient” is defined as “well advanced in an art, occupation, or branch of knowledge”). Under either of these definitions, an aggregate performance measure such as the two options described above would clearly qualify as involving a measure of proficiency. Both of the above-mentioned options provide more information about the average proficiency level of a school than an aggregate proficiency rate. Moreover, they address far more effectively than proficiency rates the core purposes of ESSA, including incentivizing more effective efforts to educate all children and providing broad discretion to states in designing their accountability systems.

We would be happy to provide more information on these recommendations at your pleasure.


Morgan Polikoff, Ph.D., Associate Professor of Education, USC Rossier School of Education


Educational Researchers and Experts

Alice Huguet, Ph.D., Postdoctoral Fellow, School of Education and Social Policy, Northwestern University

Andrew Ho, Ph.D., Professor of Education, Harvard Graduate School of Education

Andrew Saultz, Ph.D., Assistant Professor, Miami University (Ohio)

Andrew Schaper, Ph.D., Senior Associate, Basis Policy Research

Anna Egalite, Ph.D., Assistant Professor of Education, North Carolina State University

Arie van der Ploeg, Ph.D., retired Principal Researcher, American Institutes for Research

Cara Jackson, Ph.D., Assistant Director of Research & Evaluation, Urban Teachers

Christopher A. Candelaria, Ph.D., Assistant Professor of Public Policy and Education, Vanderbilt University

Cory Koedel, Ph.D., Associate Professor of Economics and Public Policy, University of Missouri

Dan Goldhaber, Ph. D., Director, Center for Education Data & Research, University of Washington Bothell

Danielle Dennis, Ph.D., Associate Professor of Literacy Studies, University of South Florida

Daniel Koretz, Ph.D., Henry Lee Shattuck Professor of Education, Harvard Graduate School of Education

David Hersh, Ph.D. Candidate, Rutgers University Bloustein School of Planning and Public Policy

David M. Rochman, Research and Program Analyst, Moose Analytics

Edward J. Fuller, Ph.D., Associate Professor of Education Policy, The Pennsylvania State University

Eric A. Houck, Associate Professor of Educational Leadership and Policy, University of North Carolina at Chapel Hill

Eric Parsons, Ph.D., Assistant Research Professor, University of Missouri

Erin O’Hara, former Assistant Commissioner for Data & Research, Tennessee Department of Education

Ethan Hutt, Ph.D., Assistant Professor of Education, University of Maryland College Park

Eva Baker, Ed.D., Distinguished Research Professor, UCLA Graduate School of Education and Information Studies, Director, Center for Research on Evaluation, Standards, and Student Testing, Past President, American Educational Research Association

Greg Palardy, Ph.D., Associate Professor, University of California, Riverside

Heather J. Hough, Ph.D., Executive Director, CORE-PACE Research Partnership

Jason A. Grissom, Ph.D., Associate Professor of Public Policy and Education, Vanderbilt University

Jeffrey Nellhaus, Ed.M., Chief of Assessment, Parcc Inc., former Deputy Commissioner, Massachusetts Department of Elementary and Secondary Education

Jeffrey W. Snyder, Ph.D., Assistant Professor, Cleveland State University

Jennifer Vranek, Founding Partner, Education First

John A. Epstein, Ed.D., Education Associate Mathematics, Delaware Department of Education

John Q. Easton, Ph.D., Vice President, Programs, Spencer Foundation, former Director, Institute of Education Sciences

John Ritzler, Ph.D., Executive Director, Research & Evaluation Services, South Bend Community School Corporation

Jonathan Plucker, Ph.D., Julian C. Stanley Professor of Talent Development, Johns Hopkins University

Joshua Cowen, Ph.D., Associate Professor of Education Policy, Michigan State University

Katherine Glenn-Applegate, Ph.D., Assistant Professor of Education, Ohio Wesleyan University

Linda Darling-Hammond, Ed.D., President, Learning Policy Institute, Charles E. Ducommun Professor of Education Emeritus, Stanford University, Past President, American Educational Research Association

Lindsay Bell Weixler, Ph.D., Senior Research Fellow, Education Research Alliance for New Orleans

Madeline Mavrogordato, Ph.D., Assistant Professor, K-12 Educational Administration, Michigan State University

Martin R. West, Ph.D., Associate Professor, Harvard Graduate School of Education

Matt Chingos, Ph.D., Senior Fellow, Urban Institute

Matthew Di Carlo, Ph.D., Senior Fellow, Albert Shanker Institute

Matthew Duque, Ph.D., Data Strategist, Baltimore County Public Schools

Matthew A. Kraft, Ed.D., Assistant Professor of Education and Economics, Brown University

Michael H. Little, Royster Fellow and Doctoral Student, University of North Carolina at Chapel Hill

Michael Hansen, Ph.D., Senior Fellow and Director, Brown Center on Education Policy, Brookings Institution

Michael J. Petrilli, President, Thomas B. Fordham Institute

Nathan Trenholm, Director of Accountability and Research, Clark County (NV) School District

  1. Tiên Lê, Doctoral Fellow, USC Rossier School of Education

Raegen T. Miller, Ed.D., Research Fellow, Georgetown University

Russell Brown, Ph.D., Chief Accountability Officer, Baltimore County Public Schools

Russell Clement, Ph.D., Research Specialist, Broward County Public Schools

Sarah Reckhow, Ph.D., Assistant Professor of Political Science, Michigan State University

Sean P. “Jack” Buckley, Ph.D., Senior Vice President, Research, The College Board, former Commissioner of the National Center for Education Statistics

Sherman Dorn, Ph.D., Professor, Mary Lou Fulton Teachers College, Arizona State University

Stephani L. Wrabel, Ph.D., USC Rossier School of Education

Thomas Toch, Georgetown University

Tom Loveless, Ph.D., Non-resident Senior Fellow, Brookings Institution


K-12 Educators

Alexander McNaughton, History Teacher, YES Prep Charter School, Houston, TX

Andrea Wood Reynolds, District Testing Coordinator, Northside ISD, TX

Angela Atkinson Duina, Ed.D., Title I School Improvement Coordinator, Portland Public Schools, ME

Ashley Baquero, J.D., English/Language Arts Teacher, Durham, NC

Brett Coffman, Ed.S., Assistant Principal, Liberty High School, MO

Callie Lowenstein, Bilingual Teacher, Washington Heights Expeditionary Learning School, NY

Candace Burckhardt, Special Education Coordinator, Indigo Education

Daniel Gohl, Chief Academic Officer, Broward County Public Schools, FL

Danielle Blue, M.Ed., Director of Preschool Programming, South Kingstown Parks and Recreation, RI

Jacquline D. Price, M.Ed., County School Superintendent, La Paz County, AZ

Jennifer Taubenheim, Elementary Special Education Teacher, Idaho Falls, ID

Jillian Haring, Staff Assistant, Broward County Public Schools, FL

Juan Gomez, Middle School Math Instructional Coach Carmel High School, Carmel, CA

Mahnaz R. Charania, Ph.D., GA

Mary F. Johnson, MLS, Ed.D., Retired school librarian

MaryEllen Falvey, M.Ed, NBCT, Office of Academics, Broward County Public Schools, FL

Meredith Heikes, 6th grade STEM teacher, Quincy School District, WA

Mike Musialowski, M.S., Math/Science Teacher, Taos, NM

Misty Pier, Special Education Teacher, Eagle Mountain Saginaw ISD, TX

Nell L. Forgacs, Ed.M., Educator, MA

Oscar Garcia, Social Studies Teacher, El Paso Academy East, TX

Patricia K. Hadley, Elementary School Teacher, Retired, Twin Falls, ID

Samantha Arce, Elementary Teacher, Phoenix, AZ

Theodore A. Hadley, High School/Middle School Teacher, Retired, Twin Falls, ID
Tim Larrabee, M.Ed., MAT, Upper Elementary Teacher, American International School of Utah

Troy Frystak, 5/6 Teacher, Springwater Environmental Sciences School, OR


Other Interested Parties

Arnold F. Shober, Ph.D., Associate Professor of Government, Lawrence University

Celine Coggins, Ph. D., Founder and CEO, Teach Plus

David Weingartner, Co-Chair Minneapolis Public Schools 2020 Advisory Committee

Joanne Weiss, former chief of staff to U.S. Secretary of Education Arne Duncan

Justin Reich, EdD, Executive Director, Teaching Systems Lab, MIT

Karl Rectanus, CEO, Lea(R)n, Inc.

Kenneth R. DeNisco, Ph.D., Associate Professor, Physics & Astronomy, Harrisburg Area Community College

Kimberly L. Glass, Ph.D., Pediatric Neuropsychologist, The Stixrud Group

Mark Otter, COO, VIF International Education

Patrick Dunn, Ph.D., Biomedical Research Curator, Northrop Grumman TS

Robert Rothman, Education Writer, Washington, DC

Steven Gorman, Ph.D., Program Manager, Academy for Lifelong Learning, LSC-Montgomery

Torrance Robinson, CEO, trovvit


Booher-Jennings, J. (2005). Below the bubble: “Educational triage” and the Texas accountability system. American Educational Research Journal, 42(1), 231–268.

Center on Education Policy. (2011, May 3). An open letter from the Center on Education Policy to the SMARTER Balanced Assessment Consortium and the Partnership for Assessment of Readiness for College and Career. Retrieved from

Ho, A. D. (2008). The problem with “proficiency”: Limitations of statistics and policy under No Child Left Behind. Educational Researcher, 37(6), 351–360.

Holland, P. W. (2002). Two measures of change in the gaps between the CDFs of test-score distributions. Journal of Educational Behavioral Statistics, 27(1), 3–17.

Kober, N., & Riddle, W. (2012). Accountability issues to watch under NCLB waivers. Washington, DC: Center on Education Policy.

Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3–13.

Linn, R. L. (2007). Educational accountability systems. Paper presented at the The CRESST Conference: The Future of Test-Based Educational Accountability.

Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of requirements of the No Child Left Behind Act of 2001. Educational Researcher, 31(6), 3–16.

Neal, D., & Schanzenbach, D. W. (2010). Left behind by design: Proficiency counts and test-based accountability. Review of Economics and Statistics, 92, 263–283.

Ng, H. L., & Koretz, D. (2015). Sensitivity of school-performance ratings to scaling decisions. Applied Measurement in Education, 28(4), 330–349.

Polikoff, M. S., McEachin, A., Wrabel, S. L., & Duque, M. (2014). The waive of the future? School accountability in the waiver era. Educational Researcher, 43(1), 45–54.

Schwartz, H. L., Hamilton, L. S., Stecher, B. M., & Steele, J. L. (2011). Expanded measures of school performance. Santa Monica, CA: The RAND Corporation.






More evidence that the test matters

Well, it’s been two months since my last post. In those two months, a lot has happened. I’ve continued digging into the textbook adoption data (this was covered on an EdWeek blog and I also wrote about it for Brookings). Fordham also released their study of the content and quality of next-generation assessments, on which I was a co-author (see my parting thoughts here). Finally, just last week I was granted tenure at USC. So I’ve been busy and haven’t written here as much as I should.

Today I’m writing about a new article of mine that’s just coming out in Educational Assessment (if you want a copy, shoot me an email). This is the last article I’ll write using the Measures of Effective Teaching data (I previously wrote here and here using these data). This paper asks a very simple question: looking across the states in the MET sample, is there evidence that the correlations of observational and student survey measures with teacher value-added vary systematically? In other words, are the tests used in these states differentially sensitive to these measures of instructional quality?

This is an important question for many reasons. Most obviously, we are using both value-added scores and instructional quality measures (observations, surveys) for an increasingly wide array of decisions, both high- and low-stakes. For any kind of decision we want to make, we want to be able to confidently say that the assessments used for value-added are sensitive to the kinds of instructional practices we think of as being “high quality.” Otherwise, for instance, it is hard to imagine how teachers could be expected to improve their value-added through professional development opportunities (i.e., if no observed instructional measures predict value-added, how can we expect teachers to improve their value added?). The work is also important because, to the extent that we see a great deal of variation across states/tests in sensitivity to instruction, it may necessitate greater attention to the assessments themselves in both research and policy [1]. As I argue in the paper, the MET data are very well suited to this kind of analysis, because there were no stakes (and thus limited potential for gaming).

The methods for investigating the question are very straightforward–basically I just correlate or regress value-added estimates from the MET study on teacher observation scores and student survey scores separately by state. Where I find limited or no evidence of relationships, I dig in further by doing things like pulling out outliers, exploring nonlinear relationships, and determining relationships at the subscale or grade level.

What I find, and how that should be interpreted, probably depends on where you sit. I do find at least some correlations of value-added with observations and student surveys in each state and subject. However, there is a good deal of state-to-state variation. For instance, in some states, student surveys correlate with value-added as high as .28 [2], while in other states those correlations are negative (though not significantly different from zero).

Analyzing results at the subscale level–where observational and survey scores are probably most likely to be useful–does not help. Perhaps because subscales are much less reliable than total scores, there are very few statistically significant correlations of subscales with VAM scores, and these too differ by state. If this pattern were to hold in new teacher evaluation systems being implemented in the states, it would raise perplexing questions about what kinds of instruction these value-added scores were sensitive to.

Perhaps the worst offender in my data is state 4 in English language arts (I cannot name states due to data restrictions). For this state, there are no total score correlations of student surveys or any of the observational measures with teacher value-added. There is one statistically significant correlation at a single grade level, and there is also one statistically significant correlation for a single subscale on one observational instrument. But otherwise, the state ELA tests in this state seem to be totally insensitive to instructional quality as measured by the Framework for Teaching, the CLASS, and the ELA-specific PLATO (not to mention the Tripod student survey). Certainly it’s possible these tests could be sensitive to some other measures not included in MET, but it’s not obvious to me what those would be (nor is it obvious that new state systems will be implemented as carefully as MET was).

I conclude with extended implications for research and practice. I think this kind of work raises a number of questions, such as:

  1. What is it about the content of these tests that makes some sensitive and others not?
  2. What kind of instruction do we want our tests to be sensitive to?
  3. How sensitive is “sensitive enough?” That is, how big a correlation do we want or need between value-added and instructional measures?
  4. If we want to provide useful feedback to teachers, we need reliable subscores on observational measures. How can we achieve that?

I enjoyed writing this article, and I believe it may well be my longest-term paper from beginning to submission. I hope you find it useful and that it raises additional questions about teacher evaluation moving forward. And I welcome your reactions (though I’m done with MET data, so if you want more analysis, I’m not your man)!


[1] The oversimplified but not-too-far-off summary of most value-added research is that it is almost completely agnostic to the test that’s used to calculate the VAM.

[2] I did not correct the correlations for measurement error, in contrast to the main MET reports.

An awful lot of districts don’t know what textbooks are used in their schools

That’s one of many takeaways of my textbook research so far. I guess to many people this is no surprise, but it seems crazy to me. Knowledge of what is going on inside schools strikes me as the most basic function of the district office. And yet I would estimate around 10% of the districts that have responded to my FOIA requests have said they have no documents listing the textbooks in use, and probably another 30-50% clearly have to invent such a document to satisfy my request [1]. Instead, I get a lot of letters like this:

Thank you for using the [district name] FOIA Center.

The FOIA office has been advised by the appropriate departments that the records you seek are not kept in the normal course of business. That is, a full and complete list of all mathematics and science textbooks currently in use by grade and the year the textbook was first used. As written, this request is categorical and unduly burdensome in nature and would require extensive resources to both search for information, which would most likely require a manual school by school search, and analysis to determine the other data points you are seeking. For these reasons, [district] is denying this request pursuant to [state statute] and invites you to narrow your request to manageable proportions. If [district] does not receive a revised request from you within five (5) business days of this response, this request will be closed.

Apparently to many folks this kind of arrangement is just fine–school sites should be able to decide all this stuff themselves. I can buy the argument that schools should have autonomy over curriculum materials (though I doubt that’s very efficient or good for kids), but even if you believe that’s the case, shouldn’t the district at least track how their money is being spent?

This is one of the research questions that’s emerged over time as I’ve gone through this textbook project, and it’s something I’ll investigate just as soon as I finish this round of FOIAs. My hypothesis? I suspect Ilana Horn is right about the consequences of this kind of non-leadership by districts:

I hope we’re wrong, but I doubt it.


[1] Districts don’t actually have to do this under the letter of FOIA law. So I very much appreciate the efforts.

This study is based upon work supported by the National Science Foundation under Grant No. 1445654 and the Smith Richardson Foundation. Any opinions, findings, and conclusions or recommendations expressed in this study are those of the author(s) and do not necessarily reflect the views of the funders.

Recruiting teachers!


I’m looking to recruit a few teachers (9, specifically) to participate in a study to test survey measures of teachers’ instruction for use in a large national study of standards implementation.
Teachers who participate in the work will be asked to do three things:
  1. Complete a bi-weekly (every other week) log survey describing their instruction in either mathematics or ELA over the course of the spring semester in a target class. The first log will be in either the last week of January or the first week of February. All logs will be online. We expect the first log will take 45 minutes to an hour, but that teachers will get more efficient at completing the logs as they go through the study.
  2. Complete a year-end online survey asking similar questions to the weekly log. This will be done at the end of the school year, likely in June.
  3. Turn in 2 weeks work of (blank) assignments and assessments that are given to students in the target class at any point during the spring semester.
For participating in this activity, we will provide each teacher with a $325 e-gift card. Please note that all work should be done outside teachers’ regular work hours.
The only eligibility requirements are:
  1. Teach a mathematics or ELA class in grades K-12.
  2. Work in a public (including charter) or private school that does not require separate research approval.
If you are interested in participating, please email me at my last name at Please share widely!