Co-occurrence graphs for word sense disambiguation in the biomedical domain

Artif Intell Med. 2018 May:87:9-19. doi: 10.1016/j.artmed.2018.03.002. Epub 2018 Mar 21.

Abstract

Word sense disambiguation is a key step for many natural language processing tasks (e.g. summarization, text classification, relation extraction) and presents a challenge to any system that aims to process documents from the biomedical domain. In this paper, we present a new graph-based unsupervised technique to address this problem. The knowledge base used in this work is a graph built with co-occurrence information from medical concepts found in scientific abstracts, and hence adapted to the specific domain. Unlike other unsupervised approaches based on static graphs such as UMLS, in this work the knowledge base takes the context of the ambiguous terms into account. Abstracts downloaded from PubMed are used for building the graph and disambiguation is performed using the personalized PageRank algorithm. Evaluation is carried out over two test datasets widely explored in the literature. Different parameters of the system are also evaluated to test robustness and scalability. Results show that the system is able to outperform state-of-the-art knowledge-based systems, obtaining more than 10% of accuracy improvement in some cases, while only requiring minimal external resources.

Keywords: Graph-based systems; Information extraction; Natural language processing; Unified medical language system; Unsupervised machine learning; Word sense disambiguation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Datasets as Topic
  • Knowledge Bases*
  • Natural Language Processing*
  • PubMed
  • Semantics*
  • Unified Medical Language System