Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies

Laila Rasmy; Firat Tiryaki; Yujia Zhou; Yang Xiang; Cui Tao; Hua Xu; Degui Zhi

doi:10.1093/jamia/ocaa180

Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies

J Am Med Inform Assoc. 2020 Oct 1;27(10):1593-1599. doi: 10.1093/jamia/ocaa180.

Authors

Laila Rasmy¹, Firat Tiryaki¹, Yujia Zhou¹, Yang Xiang¹, Cui Tao¹, Hua Xu¹, Degui Zhi¹

Affiliation

¹ School of Biomedical Informatics University of Texas Health Science Center, Houston, Texas, USA.

Abstract

Objective: Predictive disease modeling using electronic health record data is a growing field. Although clinical data in their raw form can be used directly for predictive modeling, it is a common practice to map data to standard terminologies to facilitate data aggregation and reuse. There is, however, a lack of systematic investigation of how different representations could affect the performance of predictive models, especially in the context of machine learning and deep learning.

Materials and methods: We projected the input diagnoses data in the Cerner HealthFacts database to Unified Medical Language System (UMLS) and 5 other terminologies, including CCS, CCSR, ICD-9, ICD-10, and PheWAS, and evaluated the prediction performances of these terminologies on 2 different tasks: the risk prediction of heart failure in diabetes patients and the risk prediction of pancreatic cancer. Two popular models were evaluated: logistic regression and a recurrent neural network.

Results: For logistic regression, using UMLS delivered the optimal area under the receiver operating characteristics (AUROC) results in both dengue hemorrhagic fever (81.15%) and pancreatic cancer (80.53%) tasks. For recurrent neural network, UMLS worked best for pancreatic cancer prediction (AUROC 82.24%), second only (AUROC 85.55%) to PheWAS (AUROC 85.87%) for dengue hemorrhagic fever prediction.

Discussion/conclusion: In our experiments, terminologies with larger vocabularies and finer-grained representations were associated with better prediction performances. In particular, UMLS is consistently 1 of the best-performing ones. We believe that our work may help to inform better designs of predictive models, although further investigation is warranted.

Keywords: UMLS; electronic health records; predictive modeling; terminology representation.

Publication types

Comparative Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Aged
Databases, Factual
Electronic Health Records*
Female
Humans
Male
Middle Aged
ROC Curve
Unified Medical Language System*
Vocabulary, Controlled*

Abstract

Publication types

MeSH terms

Grants and funding