RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Katerina Nastou; Farrokh Mehryary; Tomoko Ohta; Jouni Luoma; Sampo Pyysalo; Lars Juhl Jensen

doi:10.1093/database/baae095

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Database (Oxford). 2024 Sep 12:2024:baae095. doi: 10.1093/database/baae095.

Authors

Katerina Nastou¹, Farrokh Mehryary², Tomoko Ohta³, Jouni Luoma², Sampo Pyysalo², Lars Juhl Jensen¹

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3, Copenhagen 2200, Denmark.
² TurkuNLP Group, Department of Computing, University of Turku, Vesilinnantie 5, Turku 20014, Finland.
³ Textimi, 1-37-13 Kitazawa, Tokyo, Setagaya-ku 155-0031, Japan.

Abstract

In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.

MeSH terms

Biomedical Research
Data Mining* / methods
Databases, Factual
Humans
Publications

Abstract

MeSH terms

Grants and funding