Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation

Pavel B Dobrokhotov; Cyril Goutte; Anne-Lise Veuthey; Eric Gaussier

doi:10.1093/bioinformatics/btg1011

Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation

Bioinformatics. 2003:19 Suppl 1:i91-4. doi: 10.1093/bioinformatics/btg1011.

Authors

Pavel B Dobrokhotov¹, Cyril Goutte, Anne-Lise Veuthey, Eric Gaussier

Affiliation

¹ Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet - CH-1211 Geneva 4, Switzerland. Pavel.Dobrokhotov@isb-sib.ch

PMID: 12855443
DOI: 10.1093/bioinformatics/btg1011

Abstract

Motivation: Searching relevant publications for manual database annotation is a tedious task. In this paper, we apply a combination of Natural Language Processing (NLP) and probabilistic classification to re-rank documents returned by PubMed according to their relevance to Swiss-Prot annotation, and to identify significant terms in the documents.

Results: With a Probabilistic Latent Categoriser (PLC) we obtained 69% recall and 59% precision for relevant documents in a representative query. As the PLC technique provides the relative contribution of each term to the final document score, we used the Kullback-Leibler symmetric divergence to determine the most discriminating words for Swiss-Prot medical annotation. This information should allow curators to understand classification results better. It also has great value for fine-tuning the linguistic pre-processing of documents, which in turn can improve the overall classifier performance.

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't
Validation Study

MeSH terms

Abstracting and Indexing / methods*
Algorithms
Artificial Intelligence
Databases, Protein*
Documentation / methods
Models, Statistical*
Natural Language Processing*
Pattern Recognition, Automated
Periodicals as Topic / classification*
Proteins / chemistry*
Proteins / genetics
PubMed*

Substances

Proteins