More evidence that the test matters

Well, it’s been two months since my last post. In those two months, a lot has happened. I’ve continued digging into the textbook adoption data (this was covered on an EdWeek blog and I also wrote about it for Brookings). Fordham also released their study of the content and quality of next-generation assessments, on which I was a co-author (see my parting thoughts here). Finally, just last week I was granted tenure at USC. So I’ve been busy and haven’t written here as much as I should.

Today I’m writing about a new article of mine that’s just coming out in Educational Assessment (if you want a copy, shoot me an email). This is the last article I’ll write using the Measures of Effective Teaching data (I previously wrote here and here using these data). This paper asks a very simple question: looking across the states in the MET sample, is there evidence that the correlations of observational and student survey measures with teacher value-added vary systematically? In other words, are the tests used in these states differentially sensitive to these measures of instructional quality?

This is an important question for many reasons. Most obviously, we are using both value-added scores and instructional quality measures (observations, surveys) for an increasingly wide array of decisions, both high- and low-stakes. For any kind of decision we want to make, we want to be able to confidently say that the assessments used for value-added are sensitive to the kinds of instructional practices we think of as being “high quality.” Otherwise, for instance, it is hard to imagine how teachers could be expected to improve their value-added through professional development opportunities (i.e., if no observed instructional measures predict value-added, how can we expect teachers to improve their value added?). The work is also important because, to the extent that we see a great deal of variation across states/tests in sensitivity to instruction, it may necessitate greater attention to the assessments themselves in both research and policy [1]. As I argue in the paper, the MET data are very well suited to this kind of analysis, because there were no stakes (and thus limited potential for gaming).

The methods for investigating the question are very straightforward–basically I just correlate or regress value-added estimates from the MET study on teacher observation scores and student survey scores separately by state. Where I find limited or no evidence of relationships, I dig in further by doing things like pulling out outliers, exploring nonlinear relationships, and determining relationships at the subscale or grade level.

What I find, and how that should be interpreted, probably depends on where you sit. I do find at least some correlations of value-added with observations and student surveys in each state and subject. However, there is a good deal of state-to-state variation. For instance, in some states, student surveys correlate with value-added as high as .28 [2], while in other states those correlations are negative (though not significantly different from zero).

Analyzing results at the subscale level–where observational and survey scores are probably most likely to be useful–does not help. Perhaps because subscales are much less reliable than total scores, there are very few statistically significant correlations of subscales with VAM scores, and these too differ by state. If this pattern were to hold in new teacher evaluation systems being implemented in the states, it would raise perplexing questions about what kinds of instruction these value-added scores were sensitive to.

Perhaps the worst offender in my data is state 4 in English language arts (I cannot name states due to data restrictions). For this state, there are no total score correlations of student surveys or any of the observational measures with teacher value-added. There is one statistically significant correlation at a single grade level, and there is also one statistically significant correlation for a single subscale on one observational instrument. But otherwise, the state ELA tests in this state seem to be totally insensitive to instructional quality as measured by the Framework for Teaching, the CLASS, and the ELA-specific PLATO (not to mention the Tripod student survey). Certainly it’s possible these tests could be sensitive to some other measures not included in MET, but it’s not obvious to me what those would be (nor is it obvious that new state systems will be implemented as carefully as MET was).

I conclude with extended implications for research and practice. I think this kind of work raises a number of questions, such as:

  1. What is it about the content of these tests that makes some sensitive and others not?
  2. What kind of instruction do we want our tests to be sensitive to?
  3. How sensitive is “sensitive enough?” That is, how big a correlation do we want or need between value-added and instructional measures?
  4. If we want to provide useful feedback to teachers, we need reliable subscores on observational measures. How can we achieve that?

I enjoyed writing this article, and I believe it may well be my longest-term paper from beginning to submission. I hope you find it useful and that it raises additional questions about teacher evaluation moving forward. And I welcome your reactions (though I’m done with MET data, so if you want more analysis, I’m not your man)!


 

[1] The oversimplified but not-too-far-off summary of most value-added research is that it is almost completely agnostic to the test that’s used to calculate the VAM.

[2] I did not correct the correlations for measurement error, in contrast to the main MET reports.

Leave a comment