Exploring scoring methods for research studies: Accuracy and variability of visual and automated sleep scoring

Christian Berthomier; Vincenzo Muto; Christina Schmidt; Gilles Vandewalle; Mathieu Jaspar; Jonathan Devillers; Giulia Gaggioni; Sarah L Chellappa; Christelle Meyer; Christophe Phillips; Eric Salmon; Pierre Berthomier; Jacques Prado; Odile Benoit; Romain Bouet; Marie Brandewinder; Jérémie Mattout; Pierre Maquet

doi:10.1111/jsr.12994

Exploring scoring methods for research studies: Accuracy and variability of visual and automated sleep scoring

J Sleep Res. 2020 Oct;29(5):e12994. doi: 10.1111/jsr.12994. Epub 2020 Feb 18.

Authors

Christian Berthomier^#¹, Vincenzo Muto^#^{2

3

4}, Christina Schmidt^{2

4}, Gilles Vandewalle², Mathieu Jaspar^{2

3

4}, Jonathan Devillers^{2

3}, Giulia Gaggioni², Sarah L Chellappa², Christelle Meyer^{2

3}, Christophe Phillips^{2

5}, Eric Salmon², Pierre Berthomier¹, Jacques Prado¹, Odile Benoit¹, Romain Bouet⁶, Marie Brandewinder¹, Jérémie Mattout⁶, Pierre Maquet^{2

3

7}

Affiliations

¹ PHYSIP, Paris, France.
² GIGA-Cyclotron Research Centre-In vivo Imaging, University of Liège, Liège, Belgium.
³ Walloon Excellence in Life Sciences and Biotechnology (WELBIO), Liège, Belgium.
⁴ Psychology and Cognitive Neuroscience Research Unit, University of Liège, Liège, Belgium.
⁵ Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium.
⁶ Lyon Neuroscience Research Center, INSERM U1028, CNRS UMR 5292, University of Lyon 1, Lyon, France.
⁷ Department of Neurology, CHU Liège, Liège, Belgium.

^# Contributed equally.

PMID: 32067298
DOI: 10.1111/jsr.12994

Abstract

Sleep studies face new challenges in terms of data, objectives and metrics. This requires reappraising the adequacy of existing analysis methods, including scoring methods. Visual and automatic sleep scoring of healthy individuals were compared in terms of reliability (i.e., accuracy and stability) to find a scoring method capable of giving access to the actual data variability without adding exogenous variability. A first dataset (DS1, four recordings) scored by six experts plus an autoscoring algorithm was used to characterize inter-scoring variability. A second dataset (DS2, 88 recordings) scored a few weeks later was used to explore intra-expert variability. Percentage agreements and Conger's kappa were derived from epoch-by-epoch comparisons on pairwise and consensus scorings. On DS1 the number of epochs of agreement decreased when the number of experts increased, ranging from 86% (pairwise) to 69% (all experts). Adding autoscoring to visual scorings changed the kappa value from 0.81 to 0.79. Agreement between expert consensus and autoscoring was 93%. On DS2 the hypothesis of intra-expert variability was supported by a systematic decrease in kappa scores between autoscoring used as reference and each single expert between datasets (.75-.70). Although visual scoring induces inter- and intra-expert variability, autoscoring methods can cope with intra-scorer variability, making them a sensible option to reduce exogenous variability and give access to the endogenous variability in the data.

Keywords: automatic scoring; large datasets; scoring variability; visual scoring.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Healthy Volunteers
Humans
Male
Observer Variation
Polysomnography / methods*
Reproducibility of Results
Research Design / standards*
Retrospective Studies
Sleep / physiology*