Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record Deidentification

AMIA Jt Summits Transl Sci Proc. 2021 May 17:2021:102-111. eCollection 2021.

Abstract

Electronic Health Records (EHRs) have become the primary form of medical data-keeping across the United States. Federal law restricts the sharing of any EHR data that contains protected health information (PHI). De-identification, the process of identifying and removing all PHI, is crucial for making EHR data publicly available for scientific research. This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task. We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital. We found that 1) Bi-LSTM-CRF represents the best-performing encoder/decoder combination, 2) character-embeddings tend to improve precision at the price of recall, and 3) transformers alone under-perform as context encoders. Future work focused on structuring medical text may improve the extraction of semantic and syntactic information for the purposes of EHR deidentification.

MeSH terms

  • Benchmarking*
  • Data Anonymization*
  • Electronic Health Records
  • Humans
  • United States