SOS scoring ‘briefing notes’? College of Europe Bruges tried D-PAC!

In the last months, several D-PAC try-outs have run. In these try-outs, assessments are set up in diverse organizations. For the organizations, the aim is to experiment with D-PAC. For us as a team, the try-outs are valuable to gain information on different aspects of D-PAC: the user-friendliness of the tool, how the tool can be embedded in real life situations and on how information out of D-PAC is used.
A few weeks ago, a try-out ran in College of Europe Bruges. A team of four docents used D-PAC to assess students’ competences regarding ‘briefing notes’. The try-out was especially interesting for the D-PAC team given the small number of assessors and the fact that the rank order would be used in students’ final mark on a course. The reliability of the rank order was sufficient at 0.71 (see Table 1).

Table 1: General statistics

Number of representations 84
Number of comparisons 620
Number of assessors 4
Time/comparison 373 seconds
Reliability 0.71
Misfitting judges 0
Misfitting representations 2

During the try-out, one assessor kept behind with the comparisons that had to be made. At that point, we noticed that the reliability of the rank order was already sufficient with 510 comparisons. The reliability of the rank order did not increase adding 100 comparisons of the specific assessor. Curious about why this was, we investigated the progress of the reliability over time. Figure 1 shows our findings, suggesting that for this specific try-out, 10 comparisons per representation were needed to reach a reliability measure of 0.70. Moreover, the measure of 0.70 turned out to be a border that would be difficult to cross.
blog SOS_Roos
Figure 1: Progress of reliability over time

Additionally, the team was interested in how the docents used the rank order to define the final marks. The head teacher told that they discussed the first and the last representation of the rank order (see figure 2). They decided what score was appropriate for these representations (8/20 for the last one and 18/20 for the first one). Subsequently, they scored the rest of the representations following the rank order with intervals of 0.5 point.

blog SOS2_Roos
Figure 2: Rank order College of Europe Bruges

Asking the teachers of College of Europe Bruges for their experiences with D-PAC, they were very positive. D-PAC was perceived as clear and easy to use. According to the teachers, the method of comparative judgement was straightforward and appropriate for their assessing task. However, the teachers felt the need to provide more information than they could and suggested to include a pass/fail (or ‘very good’/ ‘very bad’) button and a “I cannot choose!” button.

The teachers perceived the time investment of the assessment via D-PAC as more or less the same as in previous assessments using other methods. But, the time investment in D-PAC was considered as better time for money, given the result of a reliable rank order.

Altogether, asking the teachers whether or not they would use D-PAC again for similar assessments in the future, they all agreed: “Yes!”. To conclude: the try-out partnering College of Europe Bruges turned out to be fruitful, both in terms of research findings as in terms of unrolling D-PAC in practice!

D-PAC presentatie en workshop op ORD 2016 Rotterdam

Het D-PAC team is trots volgende presentatie en workshop aan te kondigen:

Opleiden – presentatie & opdracht

Kwaliteitsvol beoordelen en feedback geven: Brengt een comparatieve aanpak soelaas?

door Marije Lesterhuis, Universiteit Antwerpen

Het beoordelen van papers van studenten is geen sinecure. Recent is paarsgewijze vergelijking (ofwel comparative judgement) als alternatieve beoordelingsvorm geïntroduceerd. Bij deze methode vergelijken meerdere beoordelaars meerdere paren van papers en geven telkens aan welke het beste is. Alhoewel de methode haar nut voor summatieve toetsen meermaals bewezen heeft, is er nog weinig gekend over de kwaliteit van de feedback die met deze methode genereerd wordt. Gedurende deze sessie willen wij ingaan op hoe paarsgewijze vergelijking werkt en nodigen we de deelnemers uit om in kleine groepen op de kwaliteit van de feedback te reflecteren.

Voor het volledige programma  klik hier.


D-PAC successfully handles video-material on large scale

A first pairwise comparison experiment with video material in D-PAC is successfully completed. The goal of this experiment was twofolded: (1) test the tool on the scalability using videos; (2) and test the inter-rater reliability.

A group of 134 students in Education Sciences had to judge 9 clips on the quality of the simulated scientific semi-structured interview demonstrated. The pairwise comparisons were all scheduled synchronously. So, in total 134 assessors were simultaneously interacting with the D-PAC system which was sending out video clips to these assessors. During the experiment no technological issues arose, leading to a very positive conclusion on the scalability of the D-PAC tool.

In order to test the inter-rater reliability the group of assessors was split in three random groups consisting out of 46, 44 and 44 assessors. All of the groups assessed the video’s in a comparative manner. The only difference between the groups was in terms of providing feedback when every comparison was completed. Group 1 was not specifically instructed to give any argumentation or feedback during the process. The second group was asked to give a short overall argumentation for their choice after each comparison. Group 3 was asked to write down some positive and negative features of each interview after each comparison. The amount of comparisons each group made was 520, 354 and 351 comparisons, respectively.

Based on the pairwise comparison data we calculated the Scale Separation Reliability for each of the three groups of assessors separately. The results are given in Table 1. From this table it can be seen that the reliabilities are high (.91 – .93).

Table 1. Scale separation reliability and average number of comparisons per video

Scale Separation Reliability Average number comparisons per video
Group 1 .93 104
Group 2 .93 79
Group 3 .91 78


To provide an answer on the question of inter-rater reliability we calculated the correlations between the estimated abilities (based on the Bradley-Terry –Luce Model) of each of the three assessments (see Table 2).  The Spearman rank correlations between the two assessments in which assessors had to provide an argumentation (Group 2) and where assessors had to provide feedback (Group 3) is the highest (.87). The Spearman rank correlations between the scores resulting from the assessment without any argumentation (Group 1) and the two other conditions are somewhat smaller (.82 and .84). Overall these correlations are high.


Table 2. Spearman Rank Correlations between scores coming from the 3 groups of assessors

Group 1 Group 2 Group 3
Group 1 1
Group 2 .82 1
Group 3 .84 .87 1


Given that each of the 36 possible pairs were assessed by multiple assessors within and between the three groups, we were able to calculate the agreement between assessors for each possible pair. In Figure 1 the agreement is plotted per pair, split up for the three groups of assessors. As shown, the average agreement in each group overall is around 77%. For some pairs the agreement is only 50%, for other pairs the agreement is 100%. These differences can, of course be partially attributed to the fact that in some of the pairs are more difficult to judge than some other pairs. Comparing the results of the three groups showed no significant differences.

fig 1 blog 3

To conclude, this pairwise comparison experiment first of all demonstrates the robustness of the tool to deal with large numbers of assessors assessing video clips simultaneously. From the resulting scales and pairwise comparison data learned us that the inter-rater reliability seems to be rather high as well.

D-Pac field trials

We at D-Pac are very happy to see that more and more organisations are interested in testing the Open source D-Pac platform ( As you can see, in the table below we have run a variety of assessments in diverse applications in the education and HR sector.

In each of these assessments, the organisation that we work with gets the opportunity to find out what assessments through Comparative Judgement can mean for them in a hands-on way. The deal is that we provide the hosted software and advice on how to set it up and run the system for free, while the organisation that hosts the assessment provides us with the research data we need to contribute to the advancement of the knowledge on Comparative Judgement and the tools that support it.

Competence Domain Assessees Assessors Reliability
1 Argumentative writing Education 135 High-school students 68 Teachers and teachers in training 0.81 average
2 Writing formal letters Education 12 High-school students 11 Teachers in training 0.68
3 Mathematical problem solving Education 58 High-school students 10 mathematics teachers + 4 mathematics teachers in training 0.80 average
4 Capability of visual representation in the arts domain Education 11 High-school students (147 representations) 13 teachers 0.85
5 Interpreting statistical output using peer-evaluation Education 44 Master students 33 Master students 0.80
6 CV-screening Human resources 42 candidates 7 HR professionals 0.88
7 Self-reflections Education 22 Master students 9 teachers 0.75
8 Project proposals 6 projects 5 judges 0.71

Besides the domains of education and human resources, we are also very curious to find out what other applications could exist for Comparative Judgement. For example, would it be useful in A/B testing in the marketing or IT development domain and would it be interesting as an alternative voting mechanism in for example TV shows?

To conclude, there could be a range of applications for the D-PAC tool that we have not yet thought of, so we are looking for ideas on how we could expand the impact of our tool.
Feel free to contact us if you are interested in testing D-PAC in your own context and together we can find out how we can help each other:


The D-PAC team

Paper gepubliceerd

Performance assessments worden vaak gebruikt om competenties te evalueren. Maar hoe scoor je nu het best zulke performance assessments? Via criterialijsten? Holistisch?

Het artikel  “Competenties kwaliteitsvol beoordelen: brengt een comparatieve aanpak soelaas?”, gepubliceerd in  ‘Tijdschrift voor Hoger Onderwijs’, stelt de comparatieve beoordelingsmethode als methode voor om performance assessments te beoordelen in een hoger onderwijscontext.