Using a pipeline to improve de-identification performance

AMIA Annu Symp Proc. 2009 Nov 14:2009:447-51.

Abstract

Effective de-identification methods are needed to support reuse of electronic health record data for research and other purposes. We investigated using two different text-processing systems in tandem as a strategy for de-identification of clinical notes. We ran 100 outpatient notes through deid.pl, from MIT's PhysioToolkit, followed by MedLEE, and we manually compared the output with original notes to determine the amount of protected health information (PHI) retained. Pipelining resulted in an overall error rate of 2%, with 2 personal names retained in output: one initial and a commonly used English term used in medicine. All retained PHI was transformed into standardized medical concepts, making re-identification less likely. Pipelining using deid.pl improved performance of MedLEE in excluding PHI from output and may be a useful strategy for de-identifying clinical data while providing computer-readable output.

Publication types

  • Comparative Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Confidentiality*
  • Electronic Health Records*
  • Humans
  • Methods
  • Natural Language Processing*