Drug-induced liver injury (DILI) is a leading cause of acute liver failure in the US and less severe liver injury worldwide. It is also one of the major reasons of drug withdrawal from the market. Thus, DILI has become one of the most important concerns of drugs, and should be predicted in very early stage of drug discovery process. In this study, a comprehensive data set containing 1317 diverse compounds was collected from publications. Then, high accuracy classification models were built using five machine learning methods based on MACCS and FP4 fingerprints after evaluating by substructure pattern recognition method. The best model was built using SVM method together with FP4 fingerprint at the IG value threshold of 0.0005. Its overall predictive accuracies were 79.7 % and 64.5 % for the training and test sets, separately, which yielded overall accuracy of 75.0 % for the external validation dataset, consisting of 88 compounds collected from a benchmark DILI database - the Liver Toxicity Knowledge Base. This model could be used for drug-induced liver toxicity prediction. Moreover, some key substructure patterns correlated with drug-induced liver toxicity were also identified as structural alerts.
Keywords: Drug-induced liver injury; machine learning; structural alerts; substructure pattern recognition.
© 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.