Using D-PAC for CV-screening

Comparative judgement is nowadays predominantly used in the educational domain. The D-PAC team aims to explore CJ’s strengths beyond this realm, for example in the recruitment and selection domain. Therefore, we conducted a try-out investigating whether or not D-PAC was successful when applied to CV-screening. Consequently we partnered with Hudson (http://be.hudson.com – a human resources consultancy company) using a received job opening from a client. Forty-two CV’s were received and D-PAC was used with 7 assessors to compare the CV’s. Assessors also provided pairwise feedback to justify each choice. The main questions were related to reliability and validity: (1) how reliable is the D-PAC assessment on CV screening with expert assessors (if the assessment would be performed again, how strongly will the ranking resemble the current one)? And (2) do all assessors look at the same and relevant criteria of the CV’s in relation to the job ad (validity)?

Results show that the assessment reached a high reliability (SSR = .88 – see figure 1). In addition to this, this high reliability was already achieved at 14 rounds. Moreover, inspecting the cut-off of acceptable reliability (SSR =.70), this was already accomplished after 9 rounds. The time investment of the total assessment was 11.5 hours, including pairwise feedback. However, since high reliability was already attained early on (9 rounds), this timing can be drastically reduced to 5 hours. Moreover, this time investment is still an overestimation, since in reality assessors do not provide feedback on the CV’s. To give an indication: it takes about 73 seconds to read two CV’s and decide which one is more in line with the job. If assessors have to give feedback to justify their choice, time increases to 90 seconds for each pair. To summarize, attaining a reliability of .70 without providing any feedback results in a time investment of 5 and a half minutes for each CV.

SSR round Hudson
Figure 1: Reliability (=SSR) of the CV-screening assessment. In total, 23 rounds were performed. Blue lines indicate different reliability levels. Reliability of .80 achieved at 14 rounds. Reliability of .70 achieved at 9 rounds

Additionally, assessors’ arguments were analysed to inspect the validity of the assessment. The main discussed themes were ‘work- and job-experience’, ‘education’, ‘over qualification’ and ‘job hopping’. Two themes were recurrent in all 7 assessors’ arguments: work- and job-experience and education. One theme was only discussed by one assessor: ‘age’. The top arguments per assessor are represented in figure 2. Most striking is that relevant experience and the amount of experience were most frequently mentioned by every assessor. Additionally, job hopping was mentioned a lot by assessor 2.

argumenten hudson
Figure 2: Top arguments given by all 7 assessors.

Next, we investigated which CV’s were in the lowest or highest position in the ranking and what type of comments they mainly received. Here, we found that when assessors mentioned something about candidates’ experience (or the lack of it), this CV had a higher chance to be lower ranked. On the other hand, when assessors discussed about candidates’ education, general experience, over qualification, bilingualism, job-hopping and the given explanation of experience, CV’s were more likely to end up at the higher part of the ranking (see table 1).

Arguments Low ranking High ranking
Amount of experience 40 26
Education 18 35
General experience 1 22
Overqualified 0 6
Bilingualism 2 8
Job-hopping 2 9
Explanation experience 0 6

Table 1: Arguments which differ between CV’s at the lower part of the ranking and the higher part of the ranking

To summarize, this try-out shows many opportunities. Firstly, it indicates that D-PAC is usable in a recruitment and selection domain, showing high reliabilities in a short amount of time. In addition to this, time investment will be reduced in future similar assessments, increasing its efficiency. Secondly, regarding the validity, the analyses of the provided arguments indicates that recruiters share the focus on relevant experience for this job. Next to this, recruiters differ in emphasis, each recruiter imposes different emphases during the assessment, which is captured when using multiple assessors. This further underpins the logic of including multiple assessors during a cv screening process.

D-PAC successfully handles video-material on large scale

A first pairwise comparison experiment with video material in D-PAC is successfully completed. The goal of this experiment was twofolded: (1) test the tool on the scalability using videos; (2) and test the inter-rater reliability.

A group of 134 students in Education Sciences had to judge 9 clips on the quality of the simulated scientific semi-structured interview demonstrated. The pairwise comparisons were all scheduled synchronously. So, in total 134 assessors were simultaneously interacting with the D-PAC system which was sending out video clips to these assessors. During the experiment no technological issues arose, leading to a very positive conclusion on the scalability of the D-PAC tool.

In order to test the inter-rater reliability the group of assessors was split in three random groups consisting out of 46, 44 and 44 assessors. All of the groups assessed the video’s in a comparative manner. The only difference between the groups was in terms of providing feedback when every comparison was completed. Group 1 was not specifically instructed to give any argumentation or feedback during the process. The second group was asked to give a short overall argumentation for their choice after each comparison. Group 3 was asked to write down some positive and negative features of each interview after each comparison. The amount of comparisons each group made was 520, 354 and 351 comparisons, respectively.

Based on the pairwise comparison data we calculated the Scale Separation Reliability for each of the three groups of assessors separately. The results are given in Table 1. From this table it can be seen that the reliabilities are high (.91 – .93).

Table 1. Scale separation reliability and average number of comparisons per video

Scale Separation Reliability Average number comparisons per video
Group 1 .93 104
Group 2 .93 79
Group 3 .91 78

 

To provide an answer on the question of inter-rater reliability we calculated the correlations between the estimated abilities (based on the Bradley-Terry –Luce Model) of each of the three assessments (see Table 2).  The Spearman rank correlations between the two assessments in which assessors had to provide an argumentation (Group 2) and where assessors had to provide feedback (Group 3) is the highest (.87). The Spearman rank correlations between the scores resulting from the assessment without any argumentation (Group 1) and the two other conditions are somewhat smaller (.82 and .84). Overall these correlations are high.

 

Table 2. Spearman Rank Correlations between scores coming from the 3 groups of assessors

Group 1 Group 2 Group 3
Group 1 1
Group 2 .82 1
Group 3 .84 .87 1

 

GiveGiven that each of the 36 possible pairs were assessed by multiple assessors within and between the three groups, we were able to calculate the agreement between assessors for each possible pair. In Figure 1 the agreement is plotted per pair, split up for the three groups of assessors. As shown, the average agreement in each group overall is around 77%. For some pairs the agreement is only 50%, for other pairs the agreement is 100%. These differences can, of course be partially attributed to the fact that in some of the pairs are more difficult to judge than some other pairs. Comparing the results of the three groups showed no significant differences.

fig 1 blog 3

To conclude, this pairwise comparison experiment first of all demonstrates the robustness of the tool to deal with large numbers of assessors assessing video clips simultaneously. From the resulting scales and pairwise comparison data learned us that the inter-rater reliability seems to be rather high as well.

Paper gepubliceerd

Performance assessments worden vaak gebruikt om competenties te evalueren. Maar hoe scoor je nu het best zulke performance assessments? Via criterialijsten? Holistisch?

Het artikel  “Competenties kwaliteitsvol beoordelen: brengt een comparatieve aanpak soelaas?”, gepubliceerd in  ‘Tijdschrift voor Hoger Onderwijs’, stelt de comparatieve beoordelingsmethode als methode voor om performance assessments te beoordelen in een hoger onderwijscontext.