Despite meticulous precautions, contamination of genomic DNA samples is not uncommon, which can significantly compromise the analysis of microorganisms' whole-genome sequencing data, thus affecting all subsequent analyses. Thanks to advancements in software and bioinformatics techniques, it is now possible to address this issue and prevent the loss of the entire dataset obtained in a contaminated whole-genome sequencing, where the DNA of another bacterium is present. In this study, it was observed that the sequencing reads from Streptomyces sp. BRB040, generated using the HiSeq System platform (Illumina Inc., San Diego, USA), were contaminated with the DNA of Bacillus licheniformis. To eliminate the contamination in Streptomyces sp. BRB040, a combination of tools available on the Galaxy platform and other web-based resources were used (MeDuSa and Blast). The contaminated reads were treated as a metagenome to isolate the genome of the contaminating organism. They were assembled using the metaSPAdes, resulting in a large scaffold of 4.187 Mb, which was identified as Bacillus licheniformis. After the identification of the contaminating organism, its genome was used as a filter to remove sequencing reads that could align using then Bowtie 2 software for this step. Once the contaminated reads were removed a new assembly was performed using the Unicycler software, yielding 117 contigs with a total size of 7.9 Mb. The completeness of this genome was assessed through BUSCO, resulting in a completeness of 95.9%. We also used an alternative tool (BBduk) to eliminate contaminated reads and the resulting assembly by Unicycler generated 85 contigs with a total size of 8.3 Mb and completeness of 99.5%. These results were better than the assembly obtained via SPAdes, which generated less complete genomes (maximum of 97.8% completeness) compared to Unicycler and which was unable to perform an adequate assembly of the data obtained from decontamination by BBduk. When compared with the uncontaminated BRB040 genome, which has a total size of 8.2 Mb and completeness of 99.8%, this pipeline revealed that the assembly performed with the decontaminated reads via BBduk presented better results, with completeness 0.3% lower than the reference. The genome mining of both genomes using antiSMASH 7.0 revealed the number of 24 Biosynthetic Gene Clusters (BGCs) for BBduk data as well as in the control assembly of the BRB040. In silico decontamination process allows the genome mining of BGCs despite the loss of nucleotides. These findings show that contamination can be effectively removed from a genome using readily available online tools, while preserving a dataset suitable for extracting valuable insights into the secondary metabolism of the target organism. This approach is particularly beneficial in scenarios where resequencing samples is not immediately feasible.
Keywords: Bioinformatics; Genomics; NGS; Natural products.
© 2024. The Author(s) under exclusive licence to Sociedade Brasileira de Microbiologia.