SPLASH: A statistical, reference-free genomic algorithm unifies biological discovery

Cell. 2023 Dec 7;186(25):5440-5456.e26. doi: 10.1016/j.cell.2023.10.028.

Abstract

Today's genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), which directly analyzes raw sequencing data, using a statistical test to detect a signature of regulation: sample-specific sequence variation. SPLASH detects many types of variation and can be efficiently run at scale. We show that SPLASH identifies complex mutation patterns in SARS-CoV-2, discovers regulated RNA isoforms at the single-cell level, detects the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a unifying approach to genomic analysis that enables expansive discovery without metadata or references.

Keywords: RNA-seq; computational biology; genetics; genomics; reference-free; single-cell RNA-seq; splicing; statistics.

MeSH terms

  • Algorithms*
  • Genome
  • Genomics*
  • HLA Antigens / genetics
  • Humans
  • Sequence Analysis, RNA
  • Single-Cell Analysis

Substances

  • HLA Antigens