by Hannah Gurr
It was the first week of June, and I’d just finished what I considered to be the least enjoyable part of my job: marking a pile of end-of-course writing under exam conditions. Whereas I really get a buzz from using classroom techniques of formative assessment (also known as ‘assessment for learning’), I found these hours of summative assessment (‘assessment of learning’) quite gruelling, although of course I accept that it’s essential. End-of-course assessment provides the purpose for the majority of what I do in and out of the classroom, and at CELFS we take pains to ensure that intended learning outcomes, lesson activity, success criteria and assessment are constructively aligned.
Perhaps it’s the lack of human contact, as the exam scripts are anonymised, precisely to prevent one’s feelings about the student influencing decisions about whether to give a ‘good’ or ‘satisfactory’ grade on a particular criterion. My ‘teacherly’ instincts are frustrated, as I can’t use this student’s errors as a springboard for further development. Then, there is the worry over standardisation – not only is there the concern that I am not in line with my colleagues, with such a lot of marking, I wonder if I’m even applying my own judgements consistently!
Having just come to the end of the ‘Language Testing and Assessment’ unit in my MSc TESOL course, I have a more realistic view of what it means to assess students’ academic speaking and writing skills on our courses and I see what can be done to compensate for the fact that even expert human judgement is fallible.*
Firstly, I realise that it would be bizarre if the diverse group of teachers that make up a particular year’s pre-sessional tutors all gave the same piece of writing the same grade. I used to worry that if my evaluation was far out from the ‘official’ grade, my job could be at risk. Now I understand that this variation is perfectly normal. Secondly, I realise that the purpose of the standardisation meeting is not to defend the grade I gave, but to re-calibrate my own scale, so that there is less variation between tutors. That’s why I feel that standardisation should come before starting teaching and again before marking.
Although we are a diverse bunch, we are still expert in assessing writing. This means that there will be a broad overlap of aspects we think are better and those we deem to be worse. However, I might be measuring this student against a Platonic ideal: a distinction-grade, native-proficiency performance, and find it wanting; whereas my neighbour is thinking, ‘wow, if I had to write something in my second language, I’d be proud if it were as good as this!’ We need to leave our egos at the door, and accept that for the next few weeks this is what we are going to label ‘good’, ‘very good’ or ‘satisfactory’.
It’s true that double marking creates more work, but in the past weeks I’ve seen again how important it is. It’s great when my colleagues and I agree, or are just one point out, and it’s reassuring to re-examine the script when we find we are a band out. I’ve been persuaded by my colleagues, but have also stuck to my guns on certain judgements, and convinced them in turn. Details I’ve missed are spotted by my colleagues, and vice versa. Excesses of both soaring into the 80s band, just because the handwriting is neat, or being overly hawkish because a student has included one of my bugbears are tempered by marking with a peer.
Finally, after haggling over all those points, there is the spectre of input error. You may have spent 5 minutes debating whether the SAQ [short answer question] merited a 65 or a 62, but then you misread or mistype and that student gets his neighbour’s 55. Again, input error is greatly reduced if you work with a colleague.
I’d be very interested to hear your thoughts on marking final assessments, whether you share(d) my concerns and/or have any ideas about how to overcome the variation in judgement that is a normal part of being human.
* An example mentioned on QI and cited in Daniel Kahneman’s book Thinking Fast and Slow (2011) is the study out of the National Academy of Sciences, which found that judges were more likely to award parole in cases they heard immediately after taking a meal break.