Building a FAIR data ecosystem for incorporating single-cell transcriptomics data into agricultural genome to phenome research

Front Genet. 2024 Nov 29:15:1460351. doi: 10.3389/fgene.2024.1460351. eCollection 2024.

Abstract

Introduction: The agriculture genomics community has numerous data submission standards available, but the standards for describing and storing single-cell (SC, e.g., scRNA- seq) data are comparatively underdeveloped.

Methods: To bridge this gap, we leveraged recent advancements in human genomics infrastructure, such as the integration of the Human Cell Atlas Data Portal with Terra, a secure, scalable, open-source platform for biomedical researchers to access data, run analysis tools, and collaborate. In parallel, the Single Cell Expression Atlas at EMBL-EBI offers a comprehensive data ingestion portal for high-throughput sequencing datasets, including plants, protists, and animals (including humans). Developing data tools connecting these resources would offer significant advantages to the agricultural genomics community. The FAANG data portal at EMBL-EBI emphasizes delivering rich metadata and highly accurate and reliable annotation of farmed animals but is not computationally linked to either of these resources.

Results: Herein, we describe a pilot-scale project that determines whether the current FAANG metadata standards for livestock can be used to ingest scRNA-seq datasets into Terra in a manner consistent with HCA Data Portal standards. Importantly, rich scRNA-seq metadata can now be brokered through the FAANG data portal using a semi-automated process, thereby avoiding the need for substantial expert curation. We have further extended the functionality of this tool so that validated and ingested SC files within the HCA Data Portal are transferred to Terra for further analysis. In addition, we verified data ingestion into Terra, hosted on Azure, and demonstrated the use of a workflow to analyze the first ingested porcine scRNA-seq dataset. Additionally, we have also developed prototype tools to visualize the output of scRNA-seq analyses on genome browsers to compare gene expression patterns across tissues and cell populations. This JBrowse tool now features distinct tracks, showcasing PBMC scRNA-seq alongside two bulk RNA-seq experiments.

Discussion: We intend to further build upon these existing tools to construct a scientist-friendly data resource and analytical ecosystem based on Findable, Accessible, Interoperable, and Reusable (FAIR) SC principles to facilitate SC-level genomic analysis through data ingestion, storage, retrieval, re-use, visualization, and comparative annotation across agricultural species.

Keywords: FAANG; FAIR; HCA data portal; data ingestion portal; data resource and analysis; data submission standards; metadata; single-cell RNA-seq.

Grants and funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was funded by a seed grant from the Agricultural Genome to Phenome Initiative (AG2PI), which is funded by USDA-NIFA award 2021-70412-35233, and by USDA NIFA AFRI award 2022-67015-37055.