The performance of computational methods and software to identify differentially expressed features in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA-seq expression features. To model the technological variability in cross-platform scRNA-seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA-seq expression profiles across experimental platforms induced by platform- and gene-specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero-inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts. Using both synthetic and published plate- and droplet-based scRNA-seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state-of-the-art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open-source software (R/Bioconductor package) is available at https://github.com/himelmallick/Tweedieverse.
Keywords: Tweedie distribution; differential expression; exponential dispersion model; generalized linear model; single-cell RNA-sequencing; zero-inflation.
© 2022 John Wiley & Sons Ltd. This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.