Which video is better? D-PAC allows educators to assess videos in an easy and credible way

Different types of media such as video, audio, or images are increasingly used for the assessment of students’ competences. However, as they allow for a large variation in performance between students, the process of grading is rather difficult. The online tool D-PAC aims to support educators in the assessment of video and images.


In D-PAC, students can easily upload their work in any media type (text, audio, image, video), after which the work is presented in randomly selected pairs to the assessors. The only task for assessors is to choose which one of the two is best, using their own expertise. Assessors find it easy to make such comparative judgements because they are not forced to score each work on a (long) list of criteria. Each work is presented multiple times to multiple assessors, resulting in a scale in which students’ worked is ranked according to its quality. 


‘Working with D-PAC was really easy and fast.’

Ivan Waumans, KDG University College

Recently, D-PAC has been used in a Bachelor Multimedia and Communication Technology for the assessment of students’ animation skills. Students received an audio fragment of the radio play by ‘Het Geluidshuis’ and had to accompany it with animation. A group of 9 assessors evaluated the quality of the animations. The assessors differed in background and expertise: 3 people from Het Geluidshuis, 2 expert animators, 2 alumni students, and 2 teachers.


For Ivan Waumans, coordinator of the course, working with D-PAC was really easy and fast. ‘About 2 hours after I sent the login information to the assessors I got an email from one of them saying: Done!’ Assessors valued that they could do the evaluations from their homes or offices. Some assessors did all the comparisons in one session, whereas others spread their comparisons over a few days. None of them had any trouble using or understanding D-PAC. The only difficulty the assessors experienced was when they had to choose between 2 videos that were of equal quality. Ivan had to reassure them that it was OK to just pick one of them, because the tool generates the same ability score for videos of equal quality. Ability scores represent the likelihood that a particular video will win from others. Based upon these scores the tool provides a ranking order in which videos are ordered from poor to high quality. 


Assessors evaluated the quality of animations using pairwise comparisons in D-PAC


‘After explaining comparative judgement, students accepted their grade’

Ivan and his team assigned grades to the animations based upon the order and ability scores. As there were gaps between ability scores, the final grades were not equally distributed over the ranking order. For instance, the top 2 videos got 18/20 and 16/20. Teachers were happy with this more objective grading system. ‘When I look at certain videos and their grade, I notice that I would have given a higher or lower grade depending on my personal taste or the relation with the students’, Ivan explained. He experienced that by including external people in the evaluation, this bias was eliminated. There were only 2 students who were a bit disappointed about the grade they received. But after explaining the procedure of comparative judgement, they accepted their grade. The fact that 9 people contributed in ranking the videos, instead of only one teacher, convinced them the grade was fair.


More information

D-PAC allows educators to assess students’ performance in video or images in a more reliable and credible way, without increasing the workload of teachers.

Want to find out more? Send us an e-mail.


Media & Learning Newsletter

This blog has been published in the newsletter of Media & Learning:

Screen Shot 2017-03-09 at 11.20.47

D-PAC successfully handles video-material on large scale

A first pairwise comparison experiment with video material in D-PAC is successfully completed. The goal of this experiment was twofolded: (1) test the tool on the scalability using videos; (2) and test the inter-rater reliability.

A group of 134 students in Education Sciences had to judge 9 clips on the quality of the simulated scientific semi-structured interview demonstrated. The pairwise comparisons were all scheduled synchronously. So, in total 134 assessors were simultaneously interacting with the D-PAC system which was sending out video clips to these assessors. During the experiment no technological issues arose, leading to a very positive conclusion on the scalability of the D-PAC tool.

In order to test the inter-rater reliability the group of assessors was split in three random groups consisting out of 46, 44 and 44 assessors. All of the groups assessed the video’s in a comparative manner. The only difference between the groups was in terms of providing feedback when every comparison was completed. Group 1 was not specifically instructed to give any argumentation or feedback during the process. The second group was asked to give a short overall argumentation for their choice after each comparison. Group 3 was asked to write down some positive and negative features of each interview after each comparison. The amount of comparisons each group made was 520, 354 and 351 comparisons, respectively.

Based on the pairwise comparison data we calculated the Scale Separation Reliability for each of the three groups of assessors separately. The results are given in Table 1. From this table it can be seen that the reliabilities are high (.91 – .93).

Table 1. Scale separation reliability and average number of comparisons per video

Scale Separation Reliability Average number comparisons per video
Group 1 .93 104
Group 2 .93 79
Group 3 .91 78


To provide an answer on the question of inter-rater reliability we calculated the correlations between the estimated abilities (based on the Bradley-Terry –Luce Model) of each of the three assessments (see Table 2).  The Spearman rank correlations between the two assessments in which assessors had to provide an argumentation (Group 2) and where assessors had to provide feedback (Group 3) is the highest (.87). The Spearman rank correlations between the scores resulting from the assessment without any argumentation (Group 1) and the two other conditions are somewhat smaller (.82 and .84). Overall these correlations are high.


Table 2. Spearman Rank Correlations between scores coming from the 3 groups of assessors

Group 1 Group 2 Group 3
Group 1 1
Group 2 .82 1
Group 3 .84 .87 1


Given that each of the 36 possible pairs were assessed by multiple assessors within and between the three groups, we were able to calculate the agreement between assessors for each possible pair. In Figure 1 the agreement is plotted per pair, split up for the three groups of assessors. As shown, the average agreement in each group overall is around 77%. For some pairs the agreement is only 50%, for other pairs the agreement is 100%. These differences can, of course be partially attributed to the fact that in some of the pairs are more difficult to judge than some other pairs. Comparing the results of the three groups showed no significant differences.

fig 1 blog 3

To conclude, this pairwise comparison experiment first of all demonstrates the robustness of the tool to deal with large numbers of assessors assessing video clips simultaneously. From the resulting scales and pairwise comparison data learned us that the inter-rater reliability seems to be rather high as well.