Improved Machine Learning-Based Model for the Classification of Off-Targets in the CRISPR/Cpf1 System

ACS Omega. 2023 Nov 17;8(48):45578-45588. doi: 10.1021/acsomega.3c05691. eCollection 2023 Dec 5.

Abstract

Targeted nucleases are widely used for altering the specific location of the genome with precision. The endonucleases facilitate efficient genome editing via designing a guide RNA (gRNA) consisting of a 20-nucleotide target sequence. gRNA preferably binds to the target location, but the on- and off-target activities of gRNAs vary widely. The off-target activity due to mismatch tolerance in the CRISPR-Cas system is a major factor inhibiting its clinical applications. Ensuring on-target efficiency and minimizing off-targets for a target sequence are the major objectives of this study. A pipeline has been designed to predict potential off-target sites in the human genome for a target sequence, and a multilayer perceptron (MLP) has been used to predict the cleavage efficiency of the potential off-target sites. An MLP-based classifier was trained with sequence- and base-dependent binding energy-associated features for AsCpf1 and LbCpf1 to predict the target efficiencies. Positional preferences of nucleotides, distribution of mismatches, and classification-dependent feature importance between high-activity and low-activity off-targets were also studied. Positional preference of nucleotides revealed that thymine is highly disfavored at positions adjacent to Protospacer Adjacent Motif (PAM), whereas guanine is favored in high-activity off-targets. Mismatch distribution analysis revealed that mismatches were more prominent in the trunk region (16, 17, 18 nucleotides from PAM sequence), and the promiscuous region and transition type mismatch were more preferred at 16, 17, and 18 nucleotides positions. The distribution of mismatches was a distinctive feature between high-activity and low-activity off-targets. Thermodynamics-associated features such as low to moderate melting temperature of the nonseed region and base-dependent PAM binding energy were predicted as best predictors by the multilayer perceptron for high-activity off-targets. GC content, some types of dinucleotide frequencies, number of bulges, and mismatches in the seed and trunk regions were other characteristic features between high-activity and low-activity off-targets for both LbCpf1 and AsCpf1.