Benchmarking atlas-level data integration in single-cell genomics

Malte D Luecken; M Büttner; K Chaichoompu; A Danese; M Interlandi; M F Mueller; D C Strobl; L Zappia; M Dugas; M Colomé-Tatché; Fabian J Theis

doi:10.1038/s41592-021-01336-8

Benchmarking atlas-level data integration in single-cell genomics

Nat Methods. 2022 Jan;19(1):41-50. doi: 10.1038/s41592-021-01336-8. Epub 2021 Dec 23.

Authors

Malte D Luecken¹, M Büttner¹, K Chaichoompu¹, A Danese¹, M Interlandi², M F Mueller¹, D C Strobl¹, L Zappia^{1

3}, M Dugas⁴, M Colomé-Tatché^{5

6

7}, Fabian J Theis^{8

9

10}

Affiliations

¹ Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany.
² Institute of Medical Informatics, University of Münster, Münster, Germany.
³ Department of Mathematics, Technische Universität München, Garching bei München, München, Germany.
⁴ Institute of Medical Informatics, Heidelberg University Hospital, Heidelberg, Germany.
⁵ Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany. maria.colome@bmc.med.lmu.de.
⁶ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany. maria.colome@bmc.med.lmu.de.
⁷ Biomedical Center (BMC), Physiological Chemistry, Faculty of Medicine, Ludwig Maximilian University of Munich, Planegg-Martinsried, Germany. maria.colome@bmc.med.lmu.de.
⁸ Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany. fabian.theis@helmholtz-muenchen.de.
⁹ Department of Mathematics, Technische Universität München, Garching bei München, München, Germany. fabian.theis@helmholtz-muenchen.de.
¹⁰ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany. fabian.theis@helmholtz-muenchen.de.

Abstract

Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Animals
Benchmarking
Computational Biology / methods*
Databases, Genetic
Genomics / methods*
Humans
Immune System / cytology
Mice
Sequence Analysis, RNA / methods
Single-Cell Analysis / methods*
Software*

Associated data

figshare/10.6084/m9.figshare.12420968

Abstract

Publication types

MeSH terms

Associated data

Grants and funding