Graph Algorithms for Condensing and Consolidating Gene Set Analysis Results

Mol Cell Proteomics. 2019 Aug 9;18(8 suppl 1):S141-S152. doi: 10.1074/mcp.TIR118.001263. Epub 2019 May 29.

Abstract

Gene set analysis plays a critical role in the functional interpretation of omics data. Although this is typically done for one omics experiment at a time, there is an increasing need to combine gene set analysis results from multiple experiments performed on the same or different omics platforms, such as in multi-omics studies. Integrating results from multiple experiments is challenging, and annotation redundancy between gene sets further obscures clear conclusions. We propose to use a weighted set cover algorithm to reduce redundancy of gene sets identified in a single experiment. Next, we use affinity propagation to consolidate similar gene sets identified from multiple experiments into clusters and to automatically determine the most representative gene set for each cluster. Using three examples from over representation analysis and gene set enrichment analysis, we showed that weighted set cover outperformed a previously published set cover method and reduced the number of gene sets by 52-77%. Focusing on overlapping genes between the list of input genes and the enriched gene sets in over-representation analysis and leading-edge genes in gene set enrichment analysis further reduced the number of gene sets. A use case combining enrichment analysis results from RNA-Seq and proteomics data comparing basal and luminal A breast cancer samples highlighted the known difference in proliferation and DNA damage response. Finally, we used these algorithms for a pan-cancer survival analysis. Our analysis clearly revealed prognosis-related pathways common to multiple cancer types or specific to individual cancer types, as well as pathways associated with prognosis in different directions in different cancer types. We implemented these two algorithms in an R package, Sumer, which generates tables and static and interactive plots for exploration and publication. Sumer is publicly available at https://github.com/bzhanglab/sumer.

Keywords: Algorithms; Bioinformatics software; Breast cancer; Cancer Biology*; Computational Biology; Data evaluation; Networks*; Omics; Pathway Analysis.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Breast Neoplasms / genetics
  • Colorectal Neoplasms / genetics
  • Female
  • Gene Expression Regulation, Neoplastic
  • Genomics / methods*
  • Humans
  • Neoplasm Proteins / genetics
  • RNA-Seq

Substances

  • Neoplasm Proteins