Last year I ran (unsuccessfully) for school committee in East Greenwich, largely motivated by what I saw as a rush to implement a highly questionable system of rating schools and teachers, a by-product of the "Race to the Top" funds RI was awarded by the Department of Education (DOE). My concern was that, at least with regards to individual teacher evaluations, the small sample sizes that ratings were based on would produce statistically unstable ratings, so that a teacher or administrator might receive a high rating one year, and a low rating the next, or vice-versa, more or less at random.
Last year three of our six schools (Meadowbrook, Cole, and EGHS) were ranked in the highest category (Commended, defined as "The strongest performance across all measures") and none were rated lower than Typical. This year, two Commended schools, EGHS and Meadowbrook, were ranked in the Warning category, while Cole slipped to Typical, and Eldredge improved to Leading. What is going on? I believe it is simply an artifact of the statistical instability of the classification process.
Here's a bit of background. The so-called "accountability" measures are based largely on "growth measures" for individual students, which are determined using the Rhode Island Growth Model, which in turn was developed by Colorado. States bidding for the DOE Race to the Top (RTT) funds were required to come up with their own "accountability" measures on short notice. Colorado, one of nine states that participated in an early pilot program to develop such measures, had designed and implemented something called the "Student Growth Percentile" (SGP) model. While the people who put this together were well qualified, it was still an educated guess as to how to do it, a work in progress without a lot of field experience to demonstrate its effectiveness (which is why it was called a pilot program). Colorado did a particularly good job of documenting their system, and made their technical software freely available as a contributed package in the open-source statistical system R (you can download R, as well as the SGP package, for free and install it on your own computer if you like). Rather than duplicate Colorado's considerable scholarly and programming investment, other states bidding for RTT (Rhode Island and Massachusetts included) quickly adopted the Colorado system.
Here is a link to a paper titled "Is Growth in Student Achievement Scale Dependent?" that was presented at the 2009 National Council for Measurement in Education conference. While the topic is only tangentially related to this discussion, the paper is notable because one of the authors, Damian Betenbenner, is a principal architect of the Colorado growth model and the SGP package (a word of warning: this is a scientific research paper). The authors start with the observation that their work is motivated by two simple questions most parents ask:
1) How much has my child learned? and
2) Is the amount my child has learned good enough?
They follow this with a lucid description of how difficult it is to answer them well:
"While the questions may be intuitive, the psychometric and statistical gymnastics involved in coming up with a defensible answer are not. Indeed, the deceptively simple nature of the questions masks important conceptual and philosophical undercurrents. Learning of what? How should learning be measured? Can a single number capture this phenomenon? Who decides how much learning is good enough?".
There follows a good overview of two types of models, the so-called Value-Added Models (VAM) and growth models such as SGP. The fact that the paper discusses two different approaches reveals that, at the time of it's writing (April, 2009) the issue of the best approach was not settled. Now fast forward to 2013, with VAM results in from Los Angeles and New York City. In May Columbia University statistics professor Andrew Gelman wrote in his Monkey Cage blog:
"A few years ago, value-added assessment etc was considered the technocratic way to go, with opponents being a bunch of Luddite dead-enders. Now, though, the whole system is falling apart. We can learn a lot from tests, no doubt about that, but there’s a lot less sense that they should be used to directly evaluate teachers. We’ve moved to a more modern, quality-control perspective in which the goal is to learn and improve the system, not to reward or punish individual workers."
So it sounds like good news that RI chose SGP instead of VAM, but in the 2009 paper cited above, the authors compared school-level results for VAM and SGP and found that the results were highly correlated:
"In tables 4-6 the correlations of school-level estimates between the two models are .72, .84 and .91 for grades 4, 5 and 6 respectively."
This means SGP is not likely to fare much better than VAM in the long run. Finally, when they describe the process of aggregation of individual student SGPs to school-level data using the median, they add:
"We do not refer to school-level SGPs as value-added estimates for two reasons. First, no residual has been computed (though this could be done easily enough by subtracting the 50th percentile), and second, we wish to avoid the causal inference that high or low SGPs can be explained by high or low school quality(for details, see Betebenner, 2008)"
Gelman adds an additional comment at the end of his blog regarding the shift away from using high-stakes tests in teacher evaluations:
"This shift may have not happened yet at the political level, but it’s my sense that this is the direction that things are going"
That may be changing. This week the U.S. House of Representatives passed its version of a renewal of the No Child Left Behind legislation that removes the teacher evaluation mandate.
This post was contributed by a community member. The views expressed here are the author's own.
The views expressed in this post are the author's own. Want to post on Patch?
More from East Greenwich
Crime & Safety|
Milk-Throwing Man Arrested: East Greenwich Cops
Crime & Safety|