Open source and reproducible and inexpensive infrastructure for data challenges and education

Sci Data. 2024 Jan 2;11(1):8. doi: 10.1038/s41597-023-02854-0.

Abstract

Data sharing is necessary to maximize the actionable knowledge generated from research data. Data challenges can encourage secondary analyses of datasets. Data challenges in biomedicine often rely on advanced cloud-based computing infrastructure and expensive industry partnerships. Examples include challenges that use Google Cloud virtual machines and the Sage Bionetworks Dream Challenges platform. Such robust infrastructures can be financially prohibitive for investigators without substantial resources. Given the potential to develop scientific and clinical knowledge and the NIH emphasis on data sharing and reuse, there is a need for inexpensive and computationally lightweight methods for data sharing and hosting data challenges. To fill that gap, we developed a workflow that allows for reproducible model training, testing, and evaluation. We leveraged public GitHub repositories, open-source computational languages, and Docker technology. In addition, we conducted a data challenge using the infrastructure we developed. In this manuscript, we report on the infrastructure, workflow, and data challenge results. The infrastructure and workflow are likely to be useful for data challenges and education.