Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads

Bioinformatics. 2012 Nov 1;28(21):2732-7. doi: 10.1093/bioinformatics/bts482. Epub 2012 Sep 1.

Abstract

Motivation: The innovation of restriction-site associated DNA sequencing (RAD-seq) method takes full advantage of next-generation sequencing technology. By clustering paired-end short reads into groups with their own unique tags, RAD-seq assembly problem is divided into subproblems. Fast and accurately clustering and assembling millions of RAD-seq reads with sequencing errors, different levels of heterozygosity and repetitive sequences is a challenging question.

Results: Rainbow is developed to provide an ultra-fast and memory-efficient solution to clustering and assembling short reads produced by RAD-seq. First, Rainbow clusters reads using a spaced seed method. Then, Rainbow implements a heterozygote calling like strategy to divide potential groups into haplotypes in a top-down manner. And along a guided tree, it iteratively merges sibling leaves in a bottom-up manner if they are similar enough. Here, the similarity is defined by comparing the 2nd reads of a RAD segment. This approach tries to collapse heterozygote while discriminate repetitive sequences. At last, Rainbow uses a greedy algorithm to locally assemble merged reads into contigs. Rainbow not only outputs the optimal but also suboptimal assembly results. Based on simulation and a real guppy RAD-seq data, we show that Rainbow is more competent than the other tools in dealing with RAD-seq data.

Availability: Source code in C, Rainbow is freely available at http://sourceforge.net/projects/bio-rainbow/files/

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Abstracting and Indexing / methods
  • Algorithms*
  • Base Sequence
  • Cluster Analysis
  • Evolution, Molecular
  • Genetics, Population
  • Haplotypes
  • Humans
  • Models, Genetic
  • Polymorphism, Single Nucleotide / genetics
  • Sequence Analysis, DNA / instrumentation*
  • Sequence Analysis, DNA / methods
  • Software*