Background: Hepatitis B virus (HBV) infection is one of the main leading causes of hepatocellular carcinoma (HCC) worldwide. However, it remains uncertain how the reverse-transcriptase (rt) gene contributes to HCC progression.
Methods: We enrolled a total of 307 patients with chronic hepatitis B (CHB) and 237 with HBV-related HCC from 13 medical centers. Sequence features comprised multidimensional attributes of rt nucleic acid and rt/s amino acid sequences. Machine-learning models were used to establish HCC predictive algorithms. Model performances were tested in the training and independent validation cohorts using receiver operating characteristic curves and calibration plots.
Results: A random forest (RF) model based on combined metrics (10 features) demonstrated the best predictive performances in both cross and independent validation (AUC, 0.96; accuracy, 0.90), irrespective of HBV genotypes and sequencing depth. Moreover, HCC risk scores for individuals obtained from the RF model (AUC, 0.966; 95% confidence interval, .922-.989) outperformed α-fetoprotein (0.713; .632-.784) in distinguishing between patients with HCC and those with CHB.
Conclusions: Our study provides evidence for the first time that HBV rt sequences contain vital HBV quasispecies features in predicting HCC. Integrating deep sequencing with feature extraction and machine-learning models benefits the longitudinal surveillance of CHB and HCC risk assessment.
Keywords: algorithm; hepatitis B virus (HBV); hepatocellular carcinoma (HCC); machine learning (ML); next-generation sequencing (NGS).
© The Author(s) 2020. Published by Oxford University Press for the Infectious Diseases Society of America. All rights reserved. For permissions, e-mail: journals.permissions@oup.com.