Databases and software for the comparison of prokaryotic genomes

This mini-review examines the rapidly expanding field of comparative prokaryotic genomics, focusing on databases and software tools developed to analyze complete genome sequences. The authors note that public databases now contain over 2,500 genomes from bacteria, viruses, plasmids, eukaryotes, and organelles. The field has evolved from first-generation tools analyzing single genomes to next-generation comparative tools enabling simultaneous analysis of multiple genomes. Key resources include primary international databases (NCBI, EBI, DDBJ), specialized databases focused on pathogenic or specific organisms, and comparative genomic databases like KEGG and IMG. The review highlights emerging challenges including genome annotation, sequence alignment of millions of base pairs, and visualization of complex multi-genome comparisons. Future directions emphasize integrating population-level data, incorporating phylogenetic frameworks, developing better tools for identifying functions of hypothetical proteins, and creating standardized metadata for genomes. The authors stress that effective comparative genomics requires interdisciplinary collaboration between bioinformaticians, ecologists, and evolutionary biologists to fully exploit the biological insights available from expanding genome collections.

Key findings

Public databases contain over 2,500 complete genomes; comparative genomics tools have evolved from single-genome analysis to multi-genome comparison platforms
Available resources span primary international databases, specialized pathogen-focused databases, and comparative platforms offering functional annotation, pathway analysis, and ortholog identification
Critical future needs include improved annotation quality, better alignment algorithms for large genomic sequences, integration of population-level and phylogenetic data, and standardized organismal metadata
Multiple strains within single bacterial species show surprising genetic diversity (e.g., only ~40% of genes common across three E. coli strains), requiring population genomics approaches
Developing new statistical methods and visualization tools to detect genome mosaicism, recombination events, and evolutionary patterns across whole-genome datasets remains a major challenge

This summary was generated automatically from the article PDF and is not part of the original publication. Refer to the PDF for the authoritative text.

Abstract

The explosion in the number of complete genomes over the past decade has spawned a new and exciting discipline, that of comparative genomics. To exploit the full potential of this approach requires the development of novel algorithms, databases and software which are sophisticated enough to draw meaningful comparisons between complete genome sequences and are widely accessible to the scientific community at large. This article reviews progress towards the development of computational tools and databases for organizing and extracting biological meaning from the comparison of large collections of genomes.

Summary auto-generated

Key findings

Abstract