It is difficult to infer causality from high-dimension metagenomic data due to interference from numerous confounders. By imitating the twin studies in genetic research, we develop a straightforward method-virtual twins (VTwins)-to eliminate the confounder effects by transforming the original cohort into a paired cohort of "Twin" samples with distinct phenotypes but matched taxonomic profiles. The results show that VTwins outperforms the conventional approach in the sensitivity of identifying causative features and only requires a 10-fold reduced sample size for recalling disease-associated microbes or pathways, as tested by simulated and empirical data. Benchmark test with other 16 kinds of software further validates the power and applicability of VTwins for handling high-dimension compositional datasets and mining causalities in metagenomic research. In conclusion, VTwins is straightforward and effective in handling high-diversity, high-dimension compositional data, promising applications in mining causalities for metagenomic and potentially other omics data. VTwins is open access and available at https://github.com/mengqingren/VTwins.
Keywords: Causality; Differential abundance; High-dimensional data; Metagenome; Paired cohort.
Copyright © 2023 Science China Press. Published by Elsevier B.V. and Science China Press. Published by Elsevier B.V. All rights reserved.