Automated speech recognition bias in personnel selection: The case of automatically scored job interviews

Louis Hickman; Markus Langer; Rachel M Saef; Louis Tay

doi:10.1037/apl0001247

Automated speech recognition bias in personnel selection: The case of automatically scored job interviews

J Appl Psychol. 2024 Oct 31. doi: 10.1037/apl0001247. Online ahead of print.

Authors

Louis Hickman¹, Markus Langer², Rachel M Saef³, Louis Tay⁴

Affiliations

¹ Department of Psychology, Virginia Tech.
² University of Freiburg.
³ Department of Psychology, Northern Illinois University.
⁴ Department of Psychological Sciences, Purdue University.

PMID: 39480322
DOI: 10.1037/apl0001247

Abstract

Organizations, researchers, and software increasingly use automatic speech recognition (ASR) to transcribe speech to text. However, ASR can be less accurate for (i.e., biased against) certain demographic subgroups. This is concerning, given that the machine-learning (ML) models used to automatically score video interviews use ASR transcriptions of interviewee responses as inputs. To address these concerns, we investigate the extent of ASR bias and its effects in automatically scored interviews. Specifically, we compare the accuracy of ASR transcription for English as a second language (ESL) versus non-ESL interviewees, people of color (and Black interviewees separately) versus White interviewees, and male versus female interviewees. Then, we test whether ASR bias causes bias in ML model scores-both in terms of differential convergent correlations (i.e., subgroup differences in correlations between observed and ML scores) and differential means (i.e., shifts in subgroup differences from observed to ML scores). To do so, we apply one human and four ASR transcription methods to two samples of mock video interviews (Ns = 1,014 and 414), and then we train and test models using these different transcripts to score multiple constructs. We observed significant bias in the commercial ASR services across nearly all comparisons, with the magnitude of bias differing across the ASR services. However, the transcription bias did not translate into meaningful measurement bias for the ML interview scores-whether in terms of differential convergent correlations or means. We discuss what these results mean for the nature of bias, fairness, and validity of ML models for scoring verbal open-ended responses. (PsycInfo Database Record (c) 2024 APA, all rights reserved).

Abstract

Grants and funding