DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

Quanhu Sheng; Yu Shyr; Xi Chen

doi:10.1186/1471-2105-15-323

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

BMC Bioinformatics. 2014 Sep 30;15(1):323. doi: 10.1186/1471-2105-15-323.

Authors

Quanhu Sheng, Yu Shyr, Xi Chen¹

Affiliation

¹ Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, TN 37232, USA. xi.steven.chen@gmail.com.

Abstract

Background: Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis.

Results: We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package.

Conclusions: Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Cluster Analysis
Data Interpretation, Statistical
Databases, Genetic
Gene Expression Profiling
Genomics / methods*
High-Throughput Nucleotide Sequencing / methods*
Meta-Analysis as Topic*
Software*

Abstract

Publication types

MeSH terms

Grants and funding