Two novel methods for using genome sequences to infer taxonomy

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Next generation sequencing (NGS) is catalysing a host of new developments across microbiology. Two papers recently published in Microbiology (Jolley et al., 2012; Bennett et al., 2012) describe methods that exploit NGS genome data to classify bacterial genomes based on core gene sequences. In general, these methods agree with 16S rRNA phylogenetic trees, but the novel methods have the added advantage of providing strain resolution within a given species. Furthermore, these approaches are scalable for large numbers of genomes, do not depend on a reference genome and can use as input genomes from different formats: finished sequences or genome assemblies in multiple contigs. Both papers focus on a set of ‘core genes’ to use; 53 genes encoding ribosomal proteins in the case of ribosomal multi-locus sequence typing (rMLST; Jolley et al., 2012) or a set of core genes defined through comparative genomics (Bennett et al., 2012).

The introduction of MLST for strain identification (Maiden et al., 1998) provided the first sequence-based approach to strain resolution for many bacterial species. However, with the advent of inexpensive whole-genome sequencing technologies that now allow sequencing a bacterial genome for close to the same price as sequencing the seven or so genes for MLST, many have wondered about expanding the set of genes to be used and indeed which genes might be optimal (for an example see Leekitcharoenphon et al., 2012). The rMLST method (Jolley et al., 2012) uses 53 genes encoding bacterial ribosomal proteins, which are found in nearly all bacteria. rMLST provides combined taxonomy and typing data, which has obvious advantages. The authors conclude that ‘the ribosome occupies the interface between genotype and phenotype that is a required focus of microbiology in the post-genomic era of research’, and hence the choice of using ribosomal proteins is a logical extension for typing.

The second approach (Bennett et al., 2012) focussed on the genus Neisseria, which contains some members that can be difficult to classify by using 16S rRNA. A set of 246 genes was found to be conserved across all the 55 Neisseria genomes in the database, and these core genes were used to construct a tree for the sequenced Neisseria strains. There were seven groups, consistent with other known data. The authors also propose that in some cases, the current names of the organisms are not consistent with their distance, based on their genome sequence. The resulting core gene tree is robust and is similar to that found by just using the 53 ribosomal proteins as in the rMLST method. However, the additional information obtained by knowing the set of ‘core genes’ as well as the variable ‘accessory genes’ for a given set of organisms can be quite useful in better understanding their underlying biology. Again, the advantage for both methods is that they will allow rapid, reproducible classification of bacterial groups based on genome sequences.

These methods offer novel ways to study the ever-increasing breadth and depth of bacterial genome data available to us; an obvious application is to metagenomic analyses of bacterial communities. The underpinning infrastructure provided by the Bacterial Isolate Genome Sequence Database (BIGSdb; ) provides a ready platform for users to interrogate allele definitions and strain data (Jolley & Maiden, 2010). The rMLST and core gene approaches will be useful tools for mining the bacterial genome data mountain.

References