EBT: a statistic test identifying moderate size of significant features with balanced power and precision for genome-wide rate comparisons

Bioinformatics. 2017 Sep 1;33(17):2631-2641. doi: 10.1093/bioinformatics/btx294.

Abstract

Motivation: In genome-wide rate comparison studies, there is a big challenge for effective identification of an appropriate number of significant features objectively, since traditional statistical comparisons without multi-testing correction can generate a large number of false positives while multi-testing correction tremendously decreases the statistic power.

Results: In this study, we proposed a new exact test based on the translation of rate comparison to two binomial distributions. With modeling and real datasets, the exact binomial test (EBT) showed an advantage in balancing the statistical precision and power, by providing an appropriate size of significant features for further studies. Both correlation analysis and bootstrapping tests demonstrated that EBT is as robust as the typical rate-comparison methods, e.g. χ 2 test, Fisher's exact test and Binomial test. Performance comparison among machine learning models with features identified by different statistical tests further demonstrated the advantage of EBT. The new test was also applied to analyze the genome-wide somatic gene mutation rate difference between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), two main lung cancer subtypes and a list of new markers were identified that could be lineage-specifically associated with carcinogenesis of LUAD and LUSC, respectively. Interestingly, three cilia genes were found selectively with high mutation rates in LUSC, possibly implying the importance of cilia dysfunction in the carcinogenesis.

Availability and implementation: An R package implementing EBT could be downloaded from the website freely: http://www.szu-bioinf.org/EBT .

Contact: wangyj@szu.edu.cn.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Adenocarcinoma / genetics*
  • Adenocarcinoma of Lung
  • Carcinoma, Squamous Cell / genetics*
  • Databases, Genetic
  • Genome, Human
  • Genomics / methods
  • Humans
  • Lung Neoplasms / genetics*
  • Machine Learning
  • Mutation*
  • Sequence Analysis, DNA / methods*