The Don’t Do It Depository

Cross posted from here.

We have known for quite a while that schools engage in all manner of tricks to improve their performance under accountability systems. These behaviors range from the innocuous—teaching the content in state standards—to the likely harmful—outright cheating.

A new study last week provided more evidence of the unintended consequences of another gaming behavior—reassigning teachers based on perceived effectiveness. Researchers Jason A. Grissom, Demetra Kalogrides and Susanna Loeb analyzed data from a large urban district and found that administrators moved the most effective teachers to the tested grades (3-6) and the least effective to the untested grades (K-2).

On the surface, this might seem like a strategy that would boost accountability ratings without affecting students’ overall performance. After all, if you lose 10 points in kindergarten but gain 10 in third grade, isn’t the net change zero?

In fact, the authors found that moving the least effective teachers to the earlier grades harmed students’ overall achievement, because those early grades simply matter more to students’ long-term trajectories. The schools’ gaming behaviors were having real, negative consequences for children.

This strategy should go down in the annals of what doesn’t work, a category that we simply don’t pay enough attention to. Over the past 15 years, there has been a concerted effort in education research to find out “what works” and to share these policies and practices with schools.

The best example of this is the push for rigorous evidence in education research through the Institute of Education Sciences and the What Works Clearinghouse. This may well be a productive strategy, but the WWC is chock full of programs that don’t seem to “work,” at least according to its own evidence standards, and I don’t think anyone believes the WWC has had its desired impact. (The former director of IES himself has joked that it might more properly be called the What Doesn’t Work Clearinghouse).

These two facts together led me to half-joke on Twitter that maybe states or the feds should change their approach toward evidence. Rather than (or in addition to) encouraging schools and districts to do good things, they should start discouraging them from doing things we know or believe to be harmful.

This could be called something like the “Don’t Do It Depository” or the “Bad Idea Warehouse” (marketing experts, help me out). Humor aside, I think there is some merit to this idea. Here, then, are a couple of the policies or practices that might be included in the first round of the Don’t Do It Depository.

The counterproductive practice of assigning top teachers to tested grades is certainly a good candidate. While we’re at it, we might also discourage schools from shuffling teachers across grades for other reasons, as recent research finds this common practice is quite harmful to student learning.

Another common school practice, particularly in response to accountability, is to explicitly prepare students for state tests. Of course, test preparation can range from teaching the content likely to be tested all the way to teaching explicit test-taking strategies (e.g., write longer essays because those get you more points). Obviously the latter is not going to improve students’ actual learning, but the former might. In any case, test preparation seems to be quite common, but there’s less evidence that you might think that it actually helps. For instance:

  • study of the ACT (which is administered statewide) in Illinois found test strategies and item practice did not improve student performance, but coursework did.
  • An earlier study in Illinois found that students exposed to more authentic intellectual work saw greater gains on the standardized tests than those not exposed to this content.
  • In the Measures of Effective Teaching Project, students were surveyed about many dimensions of the instruction they received and these were correlated with their teachers’ value-added estimates. Survey items focusing on test preparation activities were much more weakly related to student achievement gains than items focusing on instructional quality.
  • Research doesn’t even indicate that direct test preparation strategies such as those for the ACT or SAT are particularly effective, with actual student gains far lower than advertised by the test preparation companies.

In short, there’s really not great evidence that test preparation works. In light of this evidence, perhaps states or the feds could offer guidance on what kind of and how much test preparation is appropriate and discourage the rest.

Other activities or beliefs that should be discouraged include “learning styles,” the belief that individuals have preferred ways of learning such as visual vs. auditory. The American Psychological Association has put out a brief explainer debunking the existence of learning styles. Similarly, students are not digital natives, nor can they multitask, nor should they guide their own learning.

There are many great lists of bad practices that already exist; states or the feds should simply repackage them to make them shorter, clearer, and more actionable. They should also work with experts in conceptual change, given that these briefs will be directly refuting many strongly held beliefs.

Do I think this strategy would convince every school leader to stop doing counterproductive things? Certainly I do not. But this strategy, if well executed, could probably effect meaningful change in some schools, and that would be a real win for children at very little cost.

Using Research to Drive Policy and Practice

Cross posted from here.

I’m excited to be joining the Advisory Board of Evidence Based Education, and I’m looking forward to contributing what I can to their important mission. In this post, I thought I’d briefly introduce myself and my research and talk about my philosophy for using research to affect policy and practice.

My research focuses on the design, implementation and effects of standards, assessment and accountability policies. Over my last seven years as an Assistant (now Associate) Professor at the University of Southern California Rossier School of Education, I have studied a number of issues in these areas, including:

  • The alignment of state assessments of student achievement with content standards;
  • The design of states’ school accountability systems;
  • The instructional responses of teachers to state standards and assessments; and
  • The alignment and impacts of elementary mathematics textbooks.

My current work continues in this vein, studying the implementation of new “college- and career-ready” standards and the adoption, use and effects of curriculum materials in the core academic subjects.

As is clear from the above links, I have of course published my research in the typical academic journals—this kind of publication is the coin of the realm for academics at research-focused institutions. And while I also find great intrinsic value in publishing in these venues, I know that I will not be fully satisfied if my work exists solely for the eyes of other academics.

When I joined an education policy PhD program in 2006, one of the key drivers of my decision was that I wanted to do work that was relevant to policy (at the very least—impact was an even more ideal goal). Unfortunately, while my PhD programs at Vanderbilt and Penn prepared me well for the rigors of academia, they did not equip me with the tools to drive policy or practice through my research. Those skills have developed over time, through trial and error with and advice from colleagues. Here are a few lessons I have learned that may be of use to others thinking of working to ensure that their research is brought to bear on policy and practice.

First, it goes without saying that research will not be useful to policymakers or practitioners if it is not on topics that are of interest to them. This means researchers should, at a minimum, conduct research on current policies (this means timeliness is paramount). Even better would be selecting research topics (or even conducting research) together with policymakers or practitioners. If the topics come from the eventual users, they are much more likely to use the results.

Second, even the best-designed research will not affect policy or practice if it is only published in peer-reviewed journals. Early in my academic career, I attended a networking and mentoring workshop with panels of leaders from DC. I had just come off publishing an article on an extremely new and relevant federal policy in a top education journal. The paper was short (5,000 words) and accessible, I thought, so surely it would be picked up and used by congressional staff or folks at the Department of Education. The peals of laughter from the panelists when I proposed that my work might matter in its current form certainly disabused me of the idea that the research-to-policy pipeline is an easy one.

Equipped with this knowledge, I began specifically writing and publishing in outlets that I thought would be more likely to reach the eyes of those in power. These include publishing articles in practitioner-oriented journals and magazines, briefs published for state and federal audiences, and even blog posts on personal and organization websites. Out of everything I’ve written, I think the piece that might have had the greatest impact is an open letter I wrote on my personal blog about the design of accountability systems under the new federal education law. This kind of writing is very different from the peer-reviewed kind, and specific training is needed—hopefully doctoral programs will begin to offer this kind of training (and universities will begin to reward this kind of engagement).

Third, networks are absolutely essential for research to be taken up. The best research, supported by the best nonacademic writing (blogs, briefs, etc.), will not matter if no one sees it. Getting your ideas in front of people requires the building of networks, and again this is something that must be done consciously. Networks can certainly be built through social media, and they can also be built by presenting research at policy and practice conferences, through media engagement, and through work with organizations like Evidence Based Education.

These are just a few of the ideas I have accumulated over time in my goal to bring my research to bear on current issues in policy and practice. I hope that my work with Evidence Based Education will allow me to contribute to their efforts in this area as well. Through our collaboration, I think we can continue to improve the production and use of quality evidence in education.

My remarks upon winning the AERA Early Career Award

This weekend in San Antonio I was honored to receive the AERA Early Career Award. I was truly and deeply grateful to have been selected for this award, especially given the many luminaries of education research who’ve previously received it. I hope that the next phase of my career continues to meaningfully affect education research, policy, and practice. Next year I will give a lecture where I will talk about my agenda so far and my vision for the next 10 years of my research.

Of course, I couldn’t have received this award without a great deal of support from family, friends, and colleagues. Here’s what I said in my 90-second remarks:

Thank you to the committee for this award, and to my colleagues Bill Tierney and Katharine Strunk for nominating me. I’m profoundly honored.

On June 8, 2006, I packed up my bags and left Chicago to start my PhD at Vanderbilt University. I’d applied to their MPP program, but someone on their admissions committee saw something promising in my application and they convinced me to do a PhD instead.

That moment in the admissions meeting turns out to have defined my life. Six days after I moved to Nashville I had dinner with a handsome southern gentleman who would later become my husband. At the same time, I started working on a couple of research projects led by my advisor Andy Porter and his wife and co-conspirator Laura Desimone, work for which I followed them from Vandy to Penn a year later. In many ways, Andy is like a father to me, and I owe much of my academic success to him.

Everything else, I owe to my mother, who raised my brother and me mostly alone through financial and personal struggles. She taught me that common sense and honesty are just as important as smarts and hard work, and she showed me how to lead a simple, uncluttered life.

Nothing I’ve accomplished since I started studying education policy has happened without my husband, Joel, by my side. He is truly my other half.

My goal as an academic is to produce research with consequence—to bring evidence to bear on the important education policy issues of our day. I’m fortunate to be at USC Rossier, a school that truly values impact and public scholarship and supports its junior faculty to do this kind of research. In these fraught times, we as a community of scholars committed to truth must always, as we say at USC, Fight On!

Thank you.

Let’s leave the worst parts of NCLB behind

This was originally posted at the Education Gadfly.

“Those who cannot remember the past are condemned to repeat it.” It turns out this adage applies not just to global politics, but also to state education policies, and groups on both the left and the right should take heed.

No Child Left Behind (NCLB) is among the most lamented education policies in recent memory, and few of NCLB’s provisions received as much scorn as its singular focus on grade-level proficiency as the sole measure of school performance. Researchers and practitioners alike faulted the fetishizing of proficiency for things like:

  • Encouraging schools to focus their attention on students close to the proficiency cut (the “bubble kids”) as opposed to all students, including high- and low-achievers.
  • Incentivizing states to lower their definitions of “proficiency” over time.
  • Resulting in unreliable ratings of school performance that were highly sensitive to the cut scores chosen.
  • Misrepresenting both school “effectiveness” (since proficiency is so highly correlated with student characteristics) and “achievement gaps” (since the magnitude of gaps again depends tremendously on where the proficiency cut is set).
  • Throwing away vast quantities of useful information by essentially turning every child into a 1 (proficient) or a 0 (not).

(For more details on these criticisms and links to relevant research, see my previous writing on this topic.)

With some prodding from interested researchers and policy advocates, the Department of Education is allowing states to rectify this situation. Specifically, states now are permitted to use measures other than “percent proficient” for their measure of academic achievement under the Every Student Succeeds Act (ESSA). In previous posts, I recommended that the feds allow the use of performance indexes and average scale scores; performance indexes are now specifically allowed under the peer-review guidance the Department published a few weeks ago.

Despite this newfound flexibility, of the seventeen states with draft ESSA accountability plans, the Fordham Institute finds only six have moved away from percent proficient as their main measure of academic achievement. In fact, the Foundation for Excellence in Education is encouraging states to stay the course with percent proficient, arguing that it is an indicator that students will be on track for college or career success. While I agree with them that proficiency for an individual student is not a useless measure, it is an awful measure for evaluating whole schools.

Sticking with percent proficient is a terrible mistake that will doom states to many of the same issues they had under NCLB. I implore states that are still finalizing their ESSA accountability systems to learn from the past and choose better measures of school performance. Specifically, I make the following two recommendations:

  • No state should use “percent proficient” as a measure of academic achievement; all should use a performance index with a minimum of four levels for their status-based performance measures. The more levels in the index, the better it will be at accurately representing the average achievement of students in the school. States can continue reporting percent proficient on the side if compelled.
  • States should place as much emphasis as possible on measures of student growth to draw as much attention as possible to schools that are most in need of improvement.Growth measures at least attempt to estimate the actual impact of schools on students; status measures do not. From among the array of growth measures, I recommend true value-added models or student growth percentiles (though I prefer value-added models for reasons described here). These are much better choices than “growth-to-proficiency” models, which do not estimate the impact of schools and again mostly measure who is enrolled.

While both EdTrust and the Foundation for Excellence in Education recommend growth-to-proficiency measures, again, these are perhaps acceptable for individual students, but as measures of school performance there is no question these are not growth measures that approximate schools’ impacts.

Overall, the evidence on these issues is overwhelming. Educators and policymakers have complained about NCLB and “percent proficient” for as long as the policy has existed. With this evidence, and with the newfound flexibility under ESSA, there is no reason for any state to continue using percent proficient as a measure of school performance. Doing so in spite of our past experience all but ensures that many of NCLB’s worst problems will persist through the ESSA era.

Should California’s New Accountability Model Set the Bar for Other States?

This is a repost of something I published previously on the C-SAIL blog and at FutureEd.

California has released a pilot version of its long-awaited school and district performance dashboard under the federal Every Student Succeeds Act. The dashboard takes a dramatically different approach from prior accountability systems, signaling a sharp break with both the No Child Left Behind era and California’s past approach.

Not surprisingly, given the contentiousness of measuring school performance, it has drawn criticism (too many measures, a lack of clear goals and targets, the possibility for schools to receive high scores even with underperforming student groups) and praise (a fairer and more accurate summary of school performance, a reduced reliance on test scores).

I’m not exactly a neutral observer. Over the past year and a half, I played a role in the development of the dashboard as part of the state superintendent’s Accountability and Continuous Improvement Task Force, an advisory group that put forward many of the features in the new dashboard. In my view, both the dashboard’s supporters and its opponents are right.

The dashboard is clearly an intentional response to previous accountability systems’ perceived shortcomings in at least four ways:

  • California officials felt state accountability systems focused excessively on test scores under NCLB, to the neglect of other measures of school performance. In response, the new dashboard includes a wider array of non-test measures, such as chronic absenteeism and suspension rates.
  • There was a widespread, well-justified concern that prior accountability measures based primarily on achievement levels (proficiency rates) unfairly penalized schools serving more disadvantaged students and failed to reward schools for strong test score growth. (See a previous post for more on this.) In response, the new dashboard includes both achievement status and growth in its performance measures. And the state uses a more nuanced measure of status rather than merely the percent of students who are proficient.
  • California’s previous metric, the Academic Performance Index, boiled down a school’s performance to a single number on a 200-to-1000 scale. Officials creating the state’s new system believed this to be an unhelpful way to think about a school’s performance. In response, the new system offers dozens of scores but no summative rating.
  • There was near unanimity among the task force members (myself excluded), the State Board of Education, and the California Department of Education that NCLB-era accountability systems were excessively punitive, and that the focus should instead be on “continuous improvement,” rather than “test-and-punish.” As a result, the new California system is nearly silent on actual consequences for schools that don’t meet expectations.

For my money, the pilot dashboard has several valuable features. The most important of these is the focus on multiple measures of school performance. Test scores are important and should play a central role, but schools do much more than teach kids content, and we should start designing our measurement systems to be more in line with what we want schools to be doing. The pilot also rightly places increased emphasis on how their students learn in the course of a school year, regardless of where they start the year on the achievement spectrum. Finally, I appreciate that the state is laying out a theory of action for how California will improve its schools and aligning the various components of its policy systems with this theory.

Still, I have concerns about some of the choices made in the creation of the dashboard.

Most importantly, consequential accountability was left out of the task force conversation entirely. We were essentially silent on the important question of what to do when schools fail their students.

And while consequences for underperforming schools were a topic of discussion at the State Board of Education—and thus I am confident that the state will comply with federal and state law about identifying and intervening in the lowest-performing schools—I am skeptical that the state will truly hold persistently underperforming schools accountable in a meaningful way (e.g., through closure, staffing changes, charter conversion, or other consequences other than “more support”). The new dashboard does not even allow stakeholders to directly compare the performance of schools, diminishing any potential public accountability.

It was a poor decision to roll out a pilot system that makes it essentially impossible to compare schools. Parents want these tools specifically for their comparative purposes, so to not even allow this functionality is a mistake. Some organizations, such as EdSource, have stepped in to fill this gap, but the state should have led this from the start.

And while the state does have a tool that allows for some comparisons within school districts, it is clunky and cumbersome compared to information systems in other states and districts available today. The most appropriate comparison tool might be a sort of similar-schools index that compares each school to other schools with similar demographics (the state used to have this). I understand the state has plans to address this issue when it revises the system; making these comparison tools clear and usable is essential.

Also, while I understand the rationale for not offering a single summative score for a school, I think that some level of aggregation would improve on what’s available now. For example, overall scores for growth and performance level might be useful, in addition to an overall attendance/culture/climate score. The correct number of scores may not be one (though there is some research suggesting that is a more effective approach), but it is unlikely to be dozens.

Finally, the website (which, again, is a pilot) is simply missing the supporting documents needed for a parent to meaningfully engage with and make sense of the data. There is a short video and PDF guide, but these are not adequate. There is also a lack of quality translation (Google Translate is available, but the state should invest in high quality translations given the diversity of California’s citizens). Presumably the documentation will be improved for the final version.

These lessons focus primarily on the transparency of the systems, but this is just one of several principles that states should attend to (which I have offered previously): Accountability systems should actually measure school effectiveness, not just test scores. They should be transparent, allowing stakeholders to make sense of the metrics. And they should be fair, not penalizing schools for factors outside their control. As states, including California, work to create new accountability systems under ESSA, they should use these principles to guide their decisions.

It is critically important to give states some leeway as they make decisions about accountability under ESSA, to allow them to develop their own theory of action and to innovate within the confines of what’s allowed by law. I am pleased that California has put forward a clear theory of action and is employing a wide array of measures to gauge student effectiveness. However, when the dashboard tells us that a school is persistently failing to improve outcomes for its children, I hope the state is serious about addressing that failure. Otherwise, I am skeptical that the dashboard will meaningfully change California’s dismal track record of educational outcomes for its children.

We need a little patience

In the last year I’ve been doing a lot more blogging, and it’s sometimes hard for me to keep track of everything I’ve written. So I’m going to start reposting things here, in order to keep track. This is a repost of something I wrote for Fordham and for C-SAIL last week. So if you read it there, no need to read again!

It’s 2017, which means we’re in year six of the Common Core experiment. The big question that everyone wants the answer to is “Is Common Core working?” Many states seem poised to move in a new direction, especially with a new administration in Washington, and research evidence could play an instrumental role in helping states make the decision of whether to keep the standards, revise them, or replace them altogether. (Of course, it might also be that policymakers’ views on the standards are impervious to evidence.)

To my knowledge, there are two existing studies that try to assess Common Core’s impact on student achievement, both by Tom Loveless. They compare state NAEP gains between Common Core adopting and non-adopting states or compare states based on an index of the quality of their implementation of the standards. Both studies find, in essence, no effects of the standards, and the media have covered these studies using that angle. The C-SAIL project, on which I am co-principal investigator, is also considering a related question (in our case, we are asking about the impact of college- and career-readiness standards in general, including, but not limited to, the Common Core standards).

There are many challenges with doing this kind of research. A few of the most serious are:

  1. The need to use sophisticated quasi-experimental methods, since experimental methods are not available.
  2. The limited array of outcome variables available, since NAEP (which is not perfectly aligned to the Common Core) is really the only assessment that has the national comparability required and many college and career outcomes are difficult to measure.
  3. The fact that the timing of policy implementation is not clear when states varied so much in the timing of related policies like assessment and textbook adoptions.

Thus, it is not obvious when will be the right time to evaluate the policy, and with what outcomes.

Policymakers want to effect positive change through policy, and they often need to make decisions on a short cycle—after all, they often make promises in their elections, and it behooves them to show evidence that their chosen policies are working in advance of the next round of elections. The consequence is that there is a high demand for rapid evidence about policy effects, and the early evidence often contributes overwhelmingly to shaping the narrative about whether policies are working or not.

Unfortunately, there are more than a handful of examples where the early evidence on a policy turned out to be misleading, or where a policy seemed to have delayed effects. For example, the Gates Foundation’s small school reforms were widely panned as a flop in early reviews relying on student test scores, but a number of later rigorous studies showed (sometimes substantial) positive effects on outcomes such as graduation and college enrollment. It was too late, however—the initiative had already been scrapped by the time the positive evidence started rolling in.

No Child Left Behind acquired quite a negative reputation over its first half dozen years of implementation. Its accountability policies were seen as poorly targeted (they were), and it was labeled as encouraging an array of negative unintended consequences. These views quickly became well established among both researchers and policymakers. And yet, a series of recent studies have shown meaningful effects of the law on student achievement, which has done precisely zero to change public perception.

There are all manner of policies that may fit into this category to a greater or lesser extent. A state capacity building and technical assistance policy implemented in California was shelved after a few years, but evaluations found the policy improved student learning. Several school choice studies have found null or modest effects on test scores only to turn up impacts on longer-term outcomes like graduation. Even School Improvement Grants and other turnaround strategies may qualify in this category—though the recent impact evaluation was neutral, several studies have found positive effects and many have found impacts that grow as the years progress (suggesting that longer-term evaluations may yet show effects).

How does this all relate back to Common Core and other college- and career-readiness standards? There are implications for both researchers and policymakers.

For researchers, these patterns suggest that great care needs to be taken in interpreting and presenting the results of research conducted early in the implementation of Common Core and other policies. This is not to say that researchers should not investigate the early effects of policies, but rather that they should be appropriately cautious in describing what their work means. Early impact studies will virtually never provide the “final answer” as to the effectiveness of any given policy, and researchers should explicitly caution against the interpretation of their work as such.

For policymakers, there are at least two implications. First, when creating new policies, policymakers should think about both short- and long-term outcomes that are desired. Then, they should build into the law ample time before such outcomes can be observed (i.e., ensuring that decisions are not made before the law can have its intended effects). Even if this time is not explicitly built into the policy cycle, policymakers should at least be aware of these issues and adopt a stance of patience toward policy revisions. Second, to the extent that policies build in funds or plans for evaluation, these plans should include both short- and long-term evaluations.

Clearly, these suggestions run counter to prevailing preferences for immediate gratification in policymaking, but they are essential if we are to see sustained improvement in education. At a minimum, this approach might keep us from declaring failure too soon on policies that may well turn out to be successful. Since improvement through policy is almost always a process of incremental progress, failing to learn all the lessons of new policies may hamstring our efforts to develop better policies later. Finally, jumping around from policy to policy likely contributes to reform fatigue among educators, which may even undermine the success of future unrelated policies. In short, regardless of your particular policy preferences, there is good reason to move on from the “shiny object” approach to education policy and focus instead on giving old and seemingly dull objects a chance to demonstrate their worth before throwing them in the policy landfill.

New evidence that textbooks matter

It’s been six months since I’ve written here. My apologies. In the meantime I’ve written a few pieces elsewhere, such as:

  • Here and here on the problems of “percent proficient” as a measure of school performance. The feds seem to have listened to our open letter, as they are allowing states to use performance indices (and perhaps some transformation of scale scores, though there seems to be disagreement on this point) in school accountability.
  • Here and here on public opinion on education policy and an agenda for the incoming administration (admittedly, written when I thought the incoming administration would be somewhat different than the one that’s shaping up).
  • Here describing just how “common” Common Core states’ standards are.
  • Here discussing challenges with state testing and a path forward.

The main project on which I continue to work, however, is the textbook research. We are out with our first working paper (a version of which was just recently accepted for publication in AERA Open), and a corresponding brief through Brookings’ Evidence Speaks series (on which I am now a contributor).

You should check out the brief and the paper, but the short version of the findings is that we once again identify one textbook–Houghton Mifflin California Math–as producing larger achievement gains than the other most commonly adopted textbooks in California during the period 2008-2013. These gains are in the range .05 to .10 standard deviations, and they persist across multiple grades and years (ours is the longest study we are aware of on this topic). The gains may seem modest, but it is important to remember that they accrue to all students in these grades. Thus, for another policy that focuses only on low-achieving students to achieve the same total achievement effect, the impact would have to be much larger. And of course, as we’ve written elsewhere, the marginal cost of choosing this particular textbook over any other is close to zero (though we actually could not find price lists for the books under study, we know this to be true).

We are excited to have the paper out there after years (literally) of work just pulling the data together. I also presented the results in Sacramento and am optimistic that states may start to listen to the steadily growing drumbeat on the importance of collecting and analyzing data on textbook adoptions.