BMT: A Cross-Validated ThinPrep Pap Cervical Cytology Dataset for Machine Learning Model Training and Validation

Sci Data. 2024 Dec 28;11(1):1444. doi: 10.1038/s41597-024-04328-3.

Abstract

In the past several years, a few cervical Pap smear datasets have been published for use in clinical training. However, most publicly available datasets consist of pre-segmented single cell images, contain on-image annotations that must be manually edited out, or are prepared using the conventional Pap smear method. Multicellular liquid Pap image datasets are a more accurate reflection of current cervical screening techniques. While a multicellular liquid SurePath™ dataset has been created, machine learning models struggle to classify a test image set when it is prepared differently from the training set due to visual differences. Therefore, this dataset of multicellular Pap smear images prepared with the more common ThinPrep® protocol is presented as a helpful resource for training and testing artificial intelligence models, particularly for future application in cervical dysplasia diagnosis. The "Brown Multicellular ThinPrep" (BMT) dataset is the first publicly available multicellular ThinPrep® dataset, consisting of 600 clinically vetted images collected from 180 Pap smear slides from 180 patients, classified into three key diagnostic categories.

Publication types

  • Dataset

MeSH terms

  • Cervix Uteri / cytology
  • Female
  • Humans
  • Machine Learning*
  • Papanicolaou Test*
  • Uterine Cervical Neoplasms / diagnosis
  • Vaginal Smears