Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation

Bioinformatics. 2003:19 Suppl 1:i91-4. doi: 10.1093/bioinformatics/btg1011.

Abstract

Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents.

Results: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Abstracting and Indexing / methods*
  • Algorithms
  • Artificial Intelligence
  • Databases, Protein*
  • Documentation / methods
  • Models, Statistical*
  • Natural Language Processing*
  • Pattern Recognition, Automated
  • Periodicals as Topic / classification*
  • Proteins / chemistry*
  • Proteins / genetics
  • PubMed*

Substances

  • Proteins