Objective: Recent years have seen an increase in machine learning (ML)-based blood glucose (BG) forecasting models, with a growing emphasis on potential application to hybrid or closed-loop predictive glucose controllers. However, current approaches focus on evaluating the accuracy of these models using benchmark data generated under the behavior policy, which may differ significantly from the data the model may encounter in a control setting. This study challenges the efficacy of such evaluation approaches, demonstrating that they can fail to accurately capture an ML-based model's true performance in closed-loop control settings.
Methods: Forecast error measured using current evaluation approaches was compared to the control performance of two forecasters-a machine learning-based model (LSTM) and a rule-based model (Loop)-in silico when the forecasters were utilized with a model-based controller in a hybrid closed-loop setting.
Results: Under current evaluation standards, LSTM achieves a significantly lower (better) forecast error than Loop with a root mean squared error (RMSE) of 11.57 ±0.05 mg/dL vs. 18.46 ±0.07 mg/dL at the 30-minute prediction horizon. Yet in a control setting, LSTM led to significantly worse control performance with only 77.14% (IQR 66.57-84.03) time-in-range compared to 86.20% (IQR 78.28-91.21) for Loop.
Conclusion: Prevailing evaluation methods can fail to accurately capture the forecaster's performance when utilized in closed-loop settings.
Significance: Our findings underscore the limitations of current evaluation standards and the need for alternative evaluation metrics and training strategies when developing BG forecasters for closed-loop control systems.