PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

Gleb Goussarov; Ilse Cleenwerck; Mohamed Mysara; Natalie Leys; Pieter Monsieurs; Guillaume Tahon; Aurélien Carlier; Peter Vandamme; Rob Van Houdt

doi:10.1093/bioinformatics/btz964

PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

Bioinformatics. 2020 Apr 15;36(8):2337-2344. doi: 10.1093/bioinformatics/btz964.

Authors

Affiliations

¹ Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium.
² Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Department of Biochemistry and Microbiology, Faculty of Sciences, Ghent University, Ghent, Belgium.
³ LIPM, Université de Toulouse, INRAE, CNRS, Castanet-Tolosan, France.

Abstract

Motivation: One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances.

Results: Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses.

Availability and implementation: The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bacteria / genetics
Genome, Bacterial
Genomics*
Oligonucleotides
Software*

Substances

Oligonucleotides

Abstract

Publication types

MeSH terms

Substances

Grants and funding