CEMIG: prediction of the cis-regulatory motif using the de Bruijn graph from ATAC-seq

Brief Bioinform. 2023 Nov 22;25(1):bbad505. doi: 10.1093/bib/bbad505.

Abstract

Sequence motif discovery algorithms enhance the identification of novel deoxyribonucleic acid sequences with pivotal biological significance, especially transcription factor (TF)-binding motifs. The advent of assay for transposase-accessible chromatin using sequencing (ATAC-seq) has broadened the toolkit for motif characterization. Nonetheless, prevailing computational approaches have focused on delineating TF-binding footprints, with motif discovery receiving less attention. Herein, we present Cis rEgulatory Motif Influence using de Bruijn Graph (CEMIG), an algorithm leveraging de Bruijn and Hamming distance graph paradigms to predict and map motif sites. Assessment on 129 ATAC-seq datasets from the Cistrome Data Browser demonstrates CEMIG's exceptional performance, surpassing three established methodologies on four evaluative metrics. CEMIG accurately identifies both cell-type-specific and common TF motifs within GM12878 and K562 cell lines, demonstrating its comparative genomic capabilities in the identification of evolutionary conservation and cell-type specificity. In-depth transcriptional and functional genomic studies have validated the functional relevance of CEMIG-identified motifs across various cell types. CEMIG is available at https://github.com/OSU-BMBL/CEMIG, developed in C++ to ensure cross-platform compatibility with Linux, macOS and Windows operating systems.

Keywords: algorithms; chromatin accessibility; cluster analysis; graph theory; motif finding.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Benchmarking
  • Biological Evolution
  • Cell Line
  • Chromatin Immunoprecipitation Sequencing*