The school year is gaining steam and the testing madness is getting under way again. All across the country, districts already are assessing students, using both practice assessments to get them ready for the “real” tests this spring, and district assessments to see if they are on track for the practice assessments to get ready for the “real” tests this spring. I can’t help but think about the April 2014 Boston Globe article “Flunk the robo-graders” by Les Perelman and wish, once again, it would stop.
The issue? Robo-graders fail to score student essays proficiently; yet, the scores determine student’ proficiency levels and teachers’ evaluation scores and, in some states, teachers’ merit pay. If the robo-graders cannot score the tests properly, the test scores should not be used to evaluate anyone or anything. And, if Pearson and the other testing companies will not even allow access or “open-ended demonstrations of their robo-graders,” states should not award them contracts. Period.
One month into my first teaching job 12 years ago, my district sent me to a conference led by a state trainer; she had been scoring PSSAs (the Pennsylvania assessments at the time), and Intermediate Units brought her in to lead conferences on scoring for new teachers, teachers in grades just beginning to be assessed, and so on. I guess the thinking was, if teachers learned what the scorers were looking for, they could teach their students to write proficient responses. And, if teachers knew what the scorers were looking for, they could score district and practice assessments more effectively. All of this meant that we were teaching to the test, of course, but nobody mentioned that.
Now, 12 years later, humans are being taken out of the grading equation, as Pearson and other testing companies roll out their robo-graders to remove one of the two human essay scorers. We already have taken good writing and grammar instruction out of schools and curriculum and replaced it with “formula writing” as we are pressured to teach kids to “beat the tests.” Now we are taking people who can read and communicate coherently out of scoring essays. In what world does any of this make sense?
In truth, the robo-graders are scoring student essays on length and word usage most of the time. No English teacher worth his salt will tell students that it’s “quantity over quality” or to “just use a lot of big words,” and yet that is exactly what students are going to have to do in order to score well: “Robo-graders do not score by understanding meaning but almost solely by use of gross measures, especially length and the presence of pretentious language.” Perelman goes on to say, “Papers written under time pressure often have a significant correlation between length and score. Robo-graders are able to match human scores simply by over-valuing length compared to human readers. A much publicized study claimed that machines could match human readers. However, the machines accomplished this feat primarily by simply counting words.”
Need an example? Perelman gives a fantastic one in his opening: “‘According to professor of theory of knowledge Leon Trotsky, privacy is the most fundamental report of humankind. Radiation on advocates to an orator transmits gamma rays of parsimony to implode.'” Confused? So was I. That’s the point. This is gibberish. Yet, the robo-graders from Pearson would consider this “exceptionally good prose.”
But, the problem isn’t just with Pearson. When three computer science students, two from MIT and one from Harvard, developed an app that generates gibberish, “one of the major robo-graders, IntelliMetric, has consistently scored above the 90th percentile overall. In fact, IntelliMetric scored most of the incoherent essays they generated as ‘advanced’ in focus and meaning as well as in language use and style.” What are teachers, students, and districts to do, when states are contracting with these companies and expecting students to score well?
The answer is the one I have been advocating for quite some time: Stop the tests. The list of reasons to discontinue the use of these high-stakes tests is growing (I can think of about a million), as researchers begin to determine the invalidity of the tests and the processes associated with them:
- The robo-graders are not at all capable of scoring the student essays.
- Dr. Walter Stroup determined, after analyzing every Texas student’s math score, that 72% of the test scores remained the same, regardless of the student’s grade or the subject being tested. He concluded that the tests do not actually measure what the kids learn in the classroom; rather, they test how well kids can take a test.
- Dr. Denny Way, senior vice president for measurement services at Pearson, made a public statement, after Stroup’s determination, confessing that the tests are only 50% “insensitive” to instruction, so Pearson sells products knowing full well that they don’t measure half of what goes on in a classroom.
- In April 2014, the American Statistical Association condemned the use of student test scores to rate teacher performance because teachers account for only 1% to 14% of the variability in test scores
No other industry continues to use products knowing that they are ineffective and flawed. No other professionals are measured using such flawed testing materials and processes. No parent uses a product to measure any aspect of her child if she knows the results are unreliable. Would you use a set of bathroom scales or a thermometer if you knew it was broken?
So, why are we doing this to our students and our teachers? Why do states and districts continue to hand out money to Pearson and other testing companies when it is becoming all too clear that not one of their products is up to snuff?
The testing madness has to stop. Go to school board meetings. Email or call your state leaders. Email or call your national leaders. Visit UNITED OPT OUT. Vote for education this November. Stand up for our students and teachers.
Still not convinced? Read “More incoherent babble: Rating a generated essay”