Optimized Sequence Library Design for Efficient In Vitro Interaction Mapping

Yaron Orenstein; Robert Puccinelli; Ryan Kim; Polly Fordyce; Bonnie Berger

doi:10.1016/j.cels.2017.07.006

Optimized Sequence Library Design for Efficient In Vitro Interaction Mapping

Cell Syst. 2017 Sep 27;5(3):230-236.e5. doi: 10.1016/j.cels.2017.07.006.

Authors

Yaron Orenstein¹, Robert Puccinelli², Ryan Kim³, Polly Fordyce⁴, Bonnie Berger⁵

Affiliations

¹ Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
² Department of Genetics, Stanford University, Stanford, CA 94305, USA.
³ Research Science Institute, Center for Excellence in Education, McLean, VA 22102, USA.
⁴ Department of Genetics, Stanford University, Stanford, CA 94305, USA; Department of Bioengineering, Stanford University, Stanford, CA 94305, USA; ChEM-H Institute, Stanford University, Stanford, CA 94305, USA; Chan Zuckerberg Biohub, San Francisco, CA 94158, USA.
⁵ Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. Electronic address: bab@mit.edu.

Abstract

Sequence libraries that cover all k-mers enable universal, unbiased measurements of binding to both oligonucleotides and peptides. While the number of k-mers grows exponentially in k, space on all experimental platforms is limited. Here, we shrink k-mer library sizes by using joker characters, which represent all characters in the alphabet simultaneously. We present the JokerCAKE (joker covering all k-mers) algorithm for generating a short sequence such that each k-mer appears at least p times with at most one joker character per k-mer. By running our algorithm on a range of parameters and alphabets, we show that JokerCAKE produces near-optimal sequences. Moreover, through comparison with data from hundreds of DNA-protein binding experiments and with new experimental results for both standard and JokerCAKE libraries, we establish that accurate binding scores can be inferred for high-affinity k-mers using JokerCAKE libraries. JokerCAKE libraries allow researchers to search a significantly larger sequence space using the same number of experimental measurements and at the same cost.

Keywords: de Bruijn graph; microarray design; sequence libraries.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Computational Biology / methods*
DNA-Binding Proteins
Gene Library
Oligonucleotides / chemical synthesis
Oligonucleotides / genetics
Protein Interaction Mapping / methods*
Sequence Analysis, DNA / methods*
Software

Substances

DNA-Binding Proteins
Oligonucleotides

Grants and funding

R01 GM081871/GM/NIGMS NIH HHS/United States