Classifying cancer pathology reports with hierarchical self-attention networks

Shang Gao; John X Qiu; Mohammed Alawad; Jacob D Hinkle; Noah Schaefferkoetter; Hong-Jun Yoon; Blair Christian; Paul A Fearn; Lynne Penberthy; Xiao-Cheng Wu; Linda Coyle; Georgia Tourassi; Arvind Ramanathan

doi:10.1016/j.artmed.2019.101726

Classifying cancer pathology reports with hierarchical self-attention networks

Artif Intell Med. 2019 Nov:101:101726. doi: 10.1016/j.artmed.2019.101726. Epub 2019 Oct 15.

Authors

Affiliations

¹ Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, TN, USA. Electronic address: gaos@ornl.gov.
² Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
³ Surveillance Informatics Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD, USA.
⁴ Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, LA, USA.
⁵ Information Management Services Inc, Calverton, MD, USA.
⁶ Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, TN, USA. Electronic address: tourassig@ornl.gov.
⁷ Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, TN, USA. Electronic address: ramanathana@ornl.gov.

PMID: 31813492
DOI: 10.1016/j.artmed.2019.101726

Abstract

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks - site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data - Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.

Keywords: Cancer pathology reports; Clinical reports; Deep learning; Natural language processing; Text classification.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Deep Learning
Humans
Natural Language Processing
Neoplasms / classification
Neoplasms / pathology*
Neural Networks, Computer