Benchmarking recent computational tools for DNA-binding protein identification

Brief Bioinform. 2024 Nov 22;26(1):bbae634. doi: 10.1093/bib/bbae634.

Abstract

Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.

Keywords: BLAST; CD-HIT; DNA-binding protein; deep learning; machine learning; motif.

MeSH terms

  • Algorithms
  • Benchmarking*
  • Computational Biology* / methods
  • DNA-Binding Proteins* / genetics
  • DNA-Binding Proteins* / metabolism
  • Humans
  • Software*

Substances

  • DNA-Binding Proteins