A novel computational framework for genome-scale alternative transcription units prediction

Brief Bioinform. 2021 Nov 5;22(6):bbab162. doi: 10.1093/bib/bbab162.

Abstract

Alternative transcription units (ATUs) are dynamically encoded under different conditions and display overlapping patterns (sharing one or more genes) under a specific condition in bacterial genomes. Genome-scale identification of ATUs is essential for studying the emergence of human diseases caused by bacterial organisms. However, it is unrealistic to identify all ATUs using experimental techniques because of the complexity and dynamic nature of ATUs. Here, we present the first-of-its-kind computational framework, named SeqATU, for genome-scale ATU prediction based on next-generation RNA-Seq data. The framework utilizes a convex quadratic programming model to seek an optimum expression combination of all of the to-be-identified ATUs. The predicted ATUs in Escherichia coli reached a precision of 0.77/0.74 and a recall of 0.75/0.76 in the two RNA-Sequencing datasets compared with the benchmarked ATUs from third-generation RNA-Seq data. In addition, the proportion of 5'- or 3'-end genes of the predicted ATUs, having documented transcription factor binding sites and transcription termination sites, was three times greater than that of no 5'- or 3'-end genes. We further evaluated the predicted ATUs by Gene Ontology and Kyoto Encyclopedia of Genes and Genomes functional enrichment analyses. The results suggested that gene pairs frequently encoded in the same ATUs are more functionally related than those that can belong to two distinct ATUs. Overall, these results demonstrated the high reliability of predicted ATUs. We expect that the new insights derived by SeqATU will not only improve the understanding of the transcription mechanism of bacteria but also guide the reconstruction of a genome-scale transcriptional regulatory network.

Keywords: RNA-Seq; alternative transcription units; bacterial transcription regulation; convex quadratic programming; non-uniform read distribution.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Bacteria / genetics
  • Computational Biology / methods*
  • Databases, Genetic
  • Escherichia coli / genetics
  • Genome, Bacterial
  • Genome-Wide Association Study / methods*
  • Genomics / methods
  • Humans
  • RNA Isoforms*
  • RNA, Messenger / genetics
  • RNA-Seq
  • Single-Cell Analysis / methods
  • Terminator Regions, Genetic
  • Transcription Initiation Site
  • Transcription, Genetic*

Substances

  • RNA Isoforms
  • RNA, Messenger