A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

PLoS Comput Biol. 2018 Feb 15;14(2):e1005962. doi: 10.1371/journal.pcbi.1005962. eCollection 2018 Feb.

Abstract

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Abstracting and Indexing*
  • Area Under Curve
  • Computational Biology / methods
  • Data Mining / methods*
  • False Positive Reactions
  • Genes
  • Information Storage and Retrieval*
  • MEDLINE*
  • Periodicals as Topic
  • Proteins / genetics
  • ROC Curve
  • Software
  • Terminology as Topic

Substances

  • Proteins

Grants and funding

This work was funded by a grant from the Danish e-Infrastructure Cooperation (ActionableBiomarkersDK, https://www.deic.dk/ (SB), and by the Novo Nordisk Foundation (grant agreement NNF14CC0001, http://novonordiskfonden.dk/) (SB, LJJ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.