Fork me on GitHub

Betrouwbaarheid & efficiëntie

Oranges and Apples? Using Comparative Judgment for reliable briefing paper assessment in simulation games

Pierpaolo Settembri, Roos Van Gasse, Liesje Coertjens en Sven De Maeyer


Ingediend als hoofdstuk van een geredigeerd volume over EU simulaties



Achieving a fair and rigorous assessment of participants in simulation games represents a major challenge. Not only does the difficulty apply to the actual negotiation part, but it also extends to the written assignments that typically accompany a simulation. For one thing, if different raters are involved, it is important to assure that differences in severity do not affect the grades. Recently, Comparative Judgement (CJ) has been introduced as a method allowing for a team-based grading. This chapter discusses in particular the potential of comparative judgment for assessing briefing papers from 84 students. Four assessors completed 622 comparisons in the Digital Platform for the Assessment of Competences (D-PAC) tool. Results indicate that a reliability level of .71 for the final rank order, which had demanded a time investment around 10 and a half hours from the team of assessors. Next to this, there was no evidence of bias towards the most important roles in the simulation game. The study also details how the obtained rank orders were translated into grades, ranging from 11 to 17 out of 20. These elements showcase CJ’s advantage in reaching adequate reliability levels for briefing papers in an efficient manner.


Scale Separation Reliability in Comparative Judgement: What does it mean?

San Verhavert, Sven De Maeyer, Vincent Donche en Liesje Coertjens


Ingediend bij Applied Psychological Measurement




Comparative Judgement (CJ) is an alternative method for assessing competences based on Thurstone’s Law of Comparative Judgement. Assessors are asked to compare pairs of students work (representations) and judge which one is better on a certain competence. These judgements are analysed using the Bradly-Terry-Luce model resulting in a rank order of the representation. In this context the Scale Separation Reliability (SSR), coming from Rasch modelling, is typically used as reliability measure. But, to our knowledge it has never been systematically investigated if the meaning of the SSR can be transferred from Rash to CJ. As the meaning of the reliability is an important question for both assessment theory and practice, the current study looks into this. A meta-analysis was performed on 26 CJ assessments. For every assessment split-halves were performed based on assessor. The rank orders of the whole assessment and the halves were correlated and compared with SSR values using Bland-Altman plots. The correlation between the halves of an assessment was compared with the SSR of the whole assessment showing that the SSR is a good measure for split-half reliability. Comparing the SSR of one of the halves with the correlation between the two respective halves showed that the SSR can be interpreted as an inter-rater correlation. Regarding SSR as expressing a correlation with the truth, the SSR of either half is a good estimate of the correlation between the whole assessment and the respective half. It should be remarked that for all types of reliability results are mixed.


Teksten beoordelen met criterialijsten of via paarsgewijze vergelijking. Een afweging van betrouwbaarheid en tijdsinvestering.

Liesje Coertjens, Marije Lesterhuis en Sven De Maeyer


Ingediend bij Pedagogische Studiën (Special Issue)




Tekstkwaliteit betrouwbaar beoordelen zonder daar overdreven veel tijd aan te besteden is cruciaal voor zowel schrijfonderzoekers als voor de praktijk. In deze studie nemen we twee beoordelingsmethoden onder de loep: criterialijsten, die analytisch en absoluut van insteek zijn, en paarsgewijze vergelijking, een methode met een holistische en vergelijkende opzet. Voor beide methoden brengen we in kaart hoe de betrouwbaarheid verandert naarmate de groep van beoordelaars meer tijd investeert in het beoordelen. Uit de resultaten blijkt dat beoordelaars verschillen in beoordelingssnelheid. De resultaten laten ook zien dat wanneer betrouwbaarheid opgevat wordt als een maat voor de stabiliteit van de rangorde, de methode van paarsgewijze vergelijking minder tijdsinvestering vergt dan werken met criterialijsten. Vervolgonderzoek neemt bij voorkeur ook de tijd nodig om een criterialijst te ontwikkelen of om een evaluatie met behulp van paarsgewijze vergelijking op te zetten in rekening.


Identifying adaptive algorithms for increasing Comparative Judgement efficiency: review study

San Verhavert, Vincent Donche, Sven de Maeyer en Liesje Coertjens


In voorbereiding



Since the introduction of Comparative Judgement (CJ) into the domain of educational assessment in the 1990’s, it was observed that it requires a lot of comparisons to reach an acceptable level of reliability. This makes the method tedious and repetitive for assessors. Therefore, more recently, the necessity has been put forward for adaptive selection algorithms that will increase CJ’s efficiency without affecting CJ’s reliability.

We will present a first step in answering to this necessity. In a systematic review we will identify adaptive algorithms potentially increasing the efficiency of CJ and this from a broad range of research domains. As such, this review study might not only provide information for CJ research. It might also (e.g.) further spur CAT research in developing new algorithms.

In a first part, a taxonomy of the adaptiveness will be constructed. An exploratory review in the domain of Computer Adaptive Testing (CAT) reveal seven levels of distinction among which statistical paradigm (Frequentist versus Bayesian), information measure based or not and weighting or balancing are the major distinctions. In a second part we will attempt to review the efficiency of these algorithms in their respective domain, keeping the CJ context in mind.