Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Nuala A O'Leary; Eric Cox; J Bradley Holmes; W Ray Anderson; Robert Falk; Vichet Hem; Mirian T N Tsuchiya; Gregory D Schuler; Xuan Zhang; John Torcivia; Anne Ketter; Laurie Breen; Jonathan Cothran; Hena Bajwa; Jovany Tinne; Peter A Meric; Wratko Hlavina; Valerie A Schneider

doi:10.1038/s41597-024-03571-y

Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Sci Data. 2024 Jul 5;11(1):732. doi: 10.1038/s41597-024-03571-y.

Authors

Nuala A O'Leary¹, Eric Cox², J Bradley Holmes², W Ray Anderson², Robert Falk², Vichet Hem², Mirian T N Tsuchiya², Gregory D Schuler², Xuan Zhang², John Torcivia², Anne Ketter², Laurie Breen², Jonathan Cothran², Hena Bajwa², Jovany Tinne², Peter A Meric², Wratko Hlavina², Valerie A Schneider²

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD, 20894, USA. olearyna@ncbi.nlm.nih.gov.
² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD, 20894, USA.

Abstract

To explore complex biological questions, it is often necessary to access various data types from public data repositories. As the volume and complexity of biological sequence data grow, public repositories face significant challenges in ensuring that the data is easily discoverable and usable by the biological research community. To address these challenges, the National Center for Biotechnology Information (NCBI) has created NCBI Datasets. This resource provides straightforward, comprehensive, and scalable access to biological sequences, annotations, and metadata for a wide range of taxa. Following the FAIR (Findable, Accessible, Interoperable, and Reusable) data management principles, NCBI Datasets offers user-friendly web interfaces, command-line tools, and documented APIs, empowering researchers to access NCBI data seamlessly. The data is delivered as packages of sequences and metadata, thus facilitating improved data retrieval, sharing, and usability in research. Moreover, this data delivery method fosters effective data attribution and promotes its further reuse. This paper outlines the current scope of data accessible through NCBI Datasets and explains various options for exploring and downloading the data.

Publication types

Dataset

MeSH terms

Databases, Genetic
Information Storage and Retrieval
Metadata*
United States