elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling

PLoS One. 2015 Jul 16;10(7):e0132868. doi: 10.1371/journal.pone.0132868. eCollection 2015.

Abstract

elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1:40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Benchmarking
  • Contig Mapping
  • Exome*
  • Genome, Human*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Polymorphism, Single Nucleotide
  • Sequence Alignment / economics*
  • Sequence Alignment / methods
  • Sequence Alignment / statistics & numerical data
  • Software*

Grants and funding

This work is funded by Intel, Janssen Pharmaceutica, and by the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT): IWT O&O Project 130406. Charlotte Herzeel is an employee of IMEC vzw, Belgium; Pascal Costanza is an employee of Intel Corporation NV/SA, Belgium; Dries Decap and Jan Fostier are employees of iMinds vzw, Ghent, Belgium; Joke Reumers is an employee of Janssen Pharmaceutica NV/SA, Belgium. All authors are also affiliated with ExaScience Life Lab which is a consortium of companies and universities. These companies provided support in the form of salaries for these authors but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific role of each author is articulated in the “author contributions” section.