Localized assembly for long reads enables genome-wide analysis of repetitive regions at single-base resolution in human genomes

Hum Genomics. 2023 Mar 9;17(1):21. doi: 10.1186/s40246-023-00467-7.

Abstract

Background: Long-read sequencing technologies have the potential to overcome the limitations of short reads and provide a comprehensive picture of the human genome. However, the characterization of repetitive sequences by reconstructing genomic structures at high resolution solely from long reads remains difficult. Here, we developed a localized assembly method (LoMA) that constructs highly accurate consensus sequences (CSs) from long reads.

Methods: We developed LoMA by combining minimap2, MAFFT, and our algorithm, which classifies diploid haplotypes based on structural variants and CSs. Using this tool, we analyzed two human samples (NA18943 and NA19240) sequenced with the Oxford Nanopore sequencer. We defined target regions in each genome based on mapping patterns and then constructed a high-quality catalog of the human insertion solely from the long-read data.

Results: The assessment of LoMA showed a high accuracy of CSs (error rate < 0.3%) compared with raw data (error rate > 8%) and superiority to a previous study. The genome-wide analysis of NA18943 and NA19240 identified 5516 and 6542 insertions (≥ 100 bp), respectively. Most insertions (~ 80%) were derived from tandem repeats and transposable elements. We also detected processed pseudogenes, insertions in transposable elements, and long insertions (> 10 kbp). Finally, our analysis suggested that short tandem duplications are associated with gene expression and transposons.

Conclusions: Our analysis showed that LoMA constructs high-quality sequences from long reads with substantial errors. This study revealed the true structures of the insertions with high accuracy and inferred the mechanisms for the insertions, thus contributing to future human genome studies. LoMA is available at our GitHub page: https://github.com/kolikem/loma .

Keywords: Insertions; Localized assembly; Long reads; Nanopore; ONT; Structural variation; Variant.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • DNA Transposable Elements* / genetics
  • Genome, Human* / genetics
  • Genomics
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Sequence Analysis, DNA / methods

Substances

  • DNA Transposable Elements