Measuring the quality of diagnostic hypothesis sets for studies of decision support

C Friedman; A Elstein; F Wolf; G Murphy; T Franz; P Fine; P Heckerling; T Miller

Measuring the quality of diagnostic hypothesis sets for studies of decision support

Stud Health Technol Inform. 1998:52 Pt 2:864-8.

Authors

C Friedman¹, A Elstein, F Wolf, G Murphy, T Franz, P Fine, P Heckerling, T Miller

Affiliation

¹ Center for Biomedical Informatics, University of Pittsburgh, USA. cpf@cbmi.upmc.edu

PMID: 10384584

Abstract

Within medical informatics there is widespread interest in computer-based decision support and the evaluation of its impact. It is widely recognized that the measurement of dependent variables, or outcomes, represents the most challenging aspect of this work. This paper describes and reports the reliability and validity of an outcome metric for studies of diagnostic decision support. The results of this study will guide the analytic methods used in our ongoing multi-site study of the effects of decision support on diagnostic reasoning. Our measurement approach conceptualizes the quality of a diagnostic hypothesis set as having two components summed to generate a composite index: a Plausibility Component derived from ratings of each hypothesis in the set, whether correct or incorrect; and a Location Component derived from the location of the correct diagnosis if it appears in the set. The reliability of this metric is determined by the extent of interrater agreement on the plausibility of diagnostic hypotheses. Validity is determined by the extent to which the index generates scores that make sense on inspection (face validity), as well as the extent to which the component scores are non-redundant and discriminate the performance of novices and experts (construct validity). Using data from the pilot and main phases of our ongoing study (n = 124 subjects working 1116 cases), the reliability of our diagnostic quality metric was found to be 0.85-0.88. The metric was found to generate, on inspection, no clearly counterintuitive scores. Using data from the pilot phase of our study (n = 12 subjects working 108 cases), the component scores were moderately correlated (r = 0.68). The composite index, computed by equally weighting both components, was found to discriminate the hypotheses of medical students and attending physicians by 0.97 standard deviation units. Based on these findings, we have adopted this metric for use in our further research exploring the impact of decision support systems on diagnostic reasoning and will make it available to the informatics research community.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Decision Support Systems, Clinical*
Diagnosis, Computer-Assisted*
Evaluation Studies as Topic
Expert Systems
Humans
Pilot Projects
Quality Control
Reproducibility of Results

Grants and funding

R01-LM-05630/LM/NLM NIH HHS/United States