FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology

Tudor Groza; Dylan Gration; Gareth Baynam; Peter N Robinson

doi:10.1093/bioinformatics/btae406

FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology

Bioinformatics. 2024 Jul 1;40(7):btae406. doi: 10.1093/bioinformatics/btae406.

Authors

Tudor Groza^{1

2

3

4}, Dylan Gration⁵, Gareth Baynam^{1

2

5

6}, Peter N Robinson^{7

8}

Affiliations

¹ Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia.
² Telethon Kids Institute, Nedlands, WA 6009, Australia.
³ School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Bentley, WA 6102, Australia.
⁴ SingHealth Duke-NUS Institute of Precision Medicine, Singapore 169609, Singapore.
⁵ Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, Subiaco, WA 6008, Australia.
⁶ Faculty of Health and Medical Sciences, University of Western Australia, Crawley, WA 6009, Australia.
⁷ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany.
⁸ The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States.

Abstract

Motivation: Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data.

Results: We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens-to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10 000 publication abstracts in 5 s.

Availability and implementation: FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Biological Ontologies*
Humans
Natural Language Processing
Phenotype*
Software

Abstract

Publication types

MeSH terms

Grants and funding