A new method based on physical patterns to impute aerobiological datasets

PLoS One. 2024 Nov 19;19(11):e0314005. doi: 10.1371/journal.pone.0314005. eCollection 2024.

Abstract

Limited research has assessed the accuracy of imputation methods in aerobiological datasets. We conducted a simulation study to evaluate, for the first time, the effectiveness of Gappy Singular Value Decomposition (GSVD), a data-driven approach, comparing it with the moving mean interpolation, a statistical approach. Utilizing complete pollen data from two monitoring stations in northeastern Italy for 2022, we randomly generated missing data considering the combination of various proportions (5%, 10%, 25%) and gap lengths (3, 5, 7, 10 days). We imputed 4800 time series using the GSVD algorithm, specifically implemented for this study, and the moving mean algorithm of the "AeRobiology" R package. We assessed imputation accuracy by calculating the Root Mean Square Error and employed multiple linear regression models to identify factors independently affecting the error (e.g. pollen variability, simulation settings). The results showed that the GSVD was as good as the well-established moving mean method and demonstrated its strong generalization capabilities across different data types. However, the imputation error was primarily influenced by pollen characteristics and location, regardless of the imputation method used. High variability in pollen concentrations and the distribution of missing data negatively affected imputation accuracy. In conclusion, we introduced and tested a novel imputation method, demonstrating comparable performance to the statistical approach in aerobiological data reconstruction. These findings contribute to advancing aerobiological data analysis, highlighting the need for improving imputation methods.

MeSH terms

  • Algorithms*
  • Computer Simulation
  • Environmental Monitoring / methods
  • Humans
  • Italy
  • Pollen*

Grants and funding

A.M. received grants to conduct the MEETOUT study from the European Union through the Italian Ministry of University and Research under the ESF REACT-EU Green and Innovation funding programme (Ministerial Decree 1061/2021) and the NextGenerationEu funding programme (Ministerial Decree 737/2021). Article processing charges were supported by the special fund at the University of Verona dedicated to Open Access publications. S.L.C. and A.C. acknowledge the grants PID2023-147790OB-I00, TED2021-129774B-C21 and PLEC2022-009235 funded by MCIN/AEI/10.13039/501100011033 and by the European Union “NextGenerationEU”/PRTR. The authors acknowledge the MODELAIR and ENCODING projects that have received funding from the European Union’s Horizon Europe research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 101072559 and 101072779, respectively. The results of this publication reflect only the authors view and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them. A.C. acknowledges the support of Universidad Politécnica de Madrid, under the program ‘Programa Propio’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional external funding received for this study.