Taming EHR data: using semantic similarity to reduce dimensionality

Stud Health Technol Inform. 2013:192:52-6.

Abstract

Medical care data is a valuable resource that can be used for many purposes including managing and planning for future health needs as well as clinical research. However, the heterogeneity and complexity of medical data can be an obstacle in applying data mining techniques. Much of the potential value of this data therefore goes untapped. In this paper we have developed a methodology that reduces the dimensionality of primary care data, in order to make it more amenable to visualisation, mining and clustering. The methodology involves employing a combination of ontology-based semantic similarity and principal component analysis (PCA) to map the data into an appropriate and informative low dimensional space. Throughout the study, we had access to anonymised patient data from primary care in Salford, UK. The results of our application of this methodology show that diagnosis codes in primary care data can be used to map patients into an informative low dimensional space, which in turn provides the opportunity to support further data exploration and medical hypothesis formulation.

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Data Compression / methods*
  • Data Mining / methods*
  • Electronic Health Records*
  • Natural Language Processing*
  • Primary Health Care / methods*
  • Semantics*
  • Terminology as Topic*
  • United Kingdom