Discovering and Summarizing Relationships Between Chemicals, Genes, Proteins, and Diseases in PubChem

Leonid Zaslavsky; Tiejun Cheng; Asta Gindulyte; Siqian He; Sunghwan Kim; Qingliang Li; Paul Thiessen; Bo Yu; Evan E Bolton

doi:10.3389/frma.2021.689059

Discovering and Summarizing Relationships Between Chemicals, Genes, Proteins, and Diseases in PubChem

Front Res Metr Anal. 2021 Jul 12:6:689059. doi: 10.3389/frma.2021.689059. eCollection 2021.

Authors

Leonid Zaslavsky¹, Tiejun Cheng¹, Asta Gindulyte¹, Siqian He¹, Sunghwan Kim¹, Qingliang Li¹, Paul Thiessen¹, Bo Yu¹, Evan E Bolton¹

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.

Abstract

The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.

Keywords: PubChem; data mining; information retrieval; knowledge discovery; knowledge graph; knowledge panels; knowledge summarization; natural language processing.