BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models

Yanhua Qiao; Xiaolei Zhu; Haipeng Gong

doi:10.1093/bioinformatics/btab712

BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models

Bioinformatics. 2022 Jan 12;38(3):648-654. doi: 10.1093/bioinformatics/btab712.

Authors

Yanhua Qiao¹, Xiaolei Zhu², Haipeng Gong¹

Affiliations

¹ School of Life Sciences, Tsinghua University, Beijing 100084, China.
² School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.

PMID: 34643684
DOI: 10.1093/bioinformatics/btab712

Abstract

Motivation: As one of the most important post-translational modifications (PTMs), protein lysine crotonylation (Kcr) has attracted wide attention, which involves in important physiological activities, such as cell differentiation and metabolism. However, experimental methods are expensive and time-consuming for Kcr identification. Instead, computational methods can predict Kcr sites in silico with high efficiency and low cost.

Results: In this study, we proposed a novel predictor, BERT-Kcr, for protein Kcr sites prediction, which was developed by using a transfer learning method with pre-trained bidirectional encoder representations from transformers (BERT) models. These models were originally used for natural language processing (NLP) tasks, such as sentence classification. Here, we transferred each amino acid into a word as the input information to the pre-trained BERT model. The features encoded by BERT were extracted and then fed to a BiLSTM network to build our final model. Compared with the models built by other machine learning and deep learning classifiers, BERT-Kcr achieved the best performance with AUROC of 0.983 for 10-fold cross validation. Further evaluation on the independent test set indicates that BERT-Kcr outperforms the state-of-the-art model Deep-Kcr with an improvement of about 5% for AUROC. The results of our experiment indicate that the direct use of sequence information and advanced pre-trained models of NLP could be an effective way for identifying PTM sites of proteins.

Availability and implementation: The BERT-Kcr model is publicly available on http://zhulab.org.cn/BERT-Kcr_models/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Language
Lysine* / metabolism
Machine Learning*
Natural Language Processing
Protein Processing, Post-Translational

Substances

Lysine

Grants and funding

21403002/National Natural Science Foundation of China