Abstract
Abbreviations: ER, evolution rate; HGT, horizontal gene transfer; MAC, Mycobacterium avium complex; ML, maximum-likelihood; MLSA, multilocus/multigene sequence analysis; MTBC, Mycobacterium tuberculosis complex; NJ, neighbour-joining; PCA, principal component analysis; RG, rapidly growing mycobacteria; SDM, super distance matrix; SG, slow-growing mycobacteria
The strain numbers and GenBank sequence accession numbers of the type strains of the 125 species of the genus Mycobacterium used in this study and the details of the primers used for the amplification of the smpB gene are available in supplementary tables with the online version of this paper. A neighbour-joining phylogenetic tree and two additional figures constructed by multi-tree graphical analysis using TREEDYN are also available as supplementary figures.
The type strains of 125 species of the genus Mycobacterium were provided by Collection de l'Institut Pasteur, the Deutsche Sammlung von Mikroorganismen und Zellkulturen and the Culture Collection of the University of Göteborg (strain numbers and GenBank sequence accession numbers are available as Supplementary Table S1 in IJSEM Online). The DNA of Mycobacterium leprae was extracted from a clinical sample from a patient diagnosed with leprosy. The hsp65, tuf, tmRNA and smpB gene sequences were obtained from this clinical sample and the rpoB, sodA and 16S rRNA sequences were extracted from the complete genome of M. leprae (GenBank accession no. AL450380). We also included five strains provided by the Institut Pasteur of Mycobacterium kansasii that were previously studied by Picardeau (Picardeau et al., 1997). Mycobacterial DNA extraction was performed as described by Afghani & Stutman (1996).
PCR primers and conditions.
PCR was used to amplify segments of hsp65, rpoB, smpB, sodA, tuf, tmRNA and 16S rRNA, corresponding to 371 bp, 365 bp, 312 bp, 378 bp, 650 bp, 354 bp and 495 bp fragments, respectively. PCR protocols for hsp65, rpoB, sodA and 16S rRNA have been described previously (Devulder et al., 2005). PCR protocols for tuf and tmRNA were used as described previously (Mignard & Flandrois, 2007). For smpB, six primers were used successively (details of the primers are available in Supplementary Table S2 in IJSEM Online). The entire collection of strains was first tested with S5 and S6 primers. Any strains that were not amplified successfully were then tested with the S6 and S8 primers. Finally, any strains that had still failed to amplify were tested with the S2 and S3 primers. A 4 µl sample of DNA was added to the reaction mixtures that contained (at the final concentrations shown): Master Mix Fermentas (Fermentas Life Sciences; 1.25 U Taq polymerase, 200 µM each dNTP and 2 mM MgCl2), 1 mM MgCl2 (Roche Diagnostics) and 500 nM (800 nM when degenerated primers were used) of each forward and reverse primer (MWG) and 4 % DMSO. The PCR program used consisted of an initial denaturing step at 95 °C for 3 min, 30 denaturing cycles at 95 °C for 30 s, annealing at 61 °C for S5/S6, 65 °C for S6/S8 and at 53 °C for S2/S3 for 30 s, extension at 72 °C for 30 s and a final extension step at 72 °C for 7 min. A sample of complete mix plus DNA-free water was used as negative control. PCR products were electrophoresed in a 1.5 % agarose gel and visualized with ethidium bromide (Eurogentec) under UV light to check that the amplified product was the right size. The sequencing step, during which the amplicons were purified, was carried out by a service provider (Biofidal).
Sequence analyses and phylogenetic comparisons.
Out of the whole set of 132 strains (including 125 type strains), only 87 strains provided sequences for all seven genes (Table 1). The sequences for each gene from Nocardia farcinica IFM 10152 were extracted from GenBank (accession no. NC_006361) to constitute the external outgroup for each of the single-gene databases. The MUSCLE program (Edgar, 2004) was used for multiple alignments of the nucleotide sequences. The unmatched ends were deleted to obtain a homogeneous matrix of characters and thus increase the reliability of the tree obtained. Alignments were then concatenated for each strain as a supermatrix of characters. A first supermatrix was restricted to the 87 strains for which all the gene sequences were known (R set). A second supermatrix was then constructed including all 132 strain sequences, any missing data being left as gaps (Whole or W set).
Table 1. Evaluation of congruence and support quality of single gene trees and of the concatenated tree calculated by maximum-likelihood Values are given as percentages. Congruence was assessed from the proportion of nodes conserved in the two trees being compared. Support quality was assessed from the proportion of nodes with an aLRT statistic of >75 %.
Phylogenies were estimated by maximum-likelihood (ML) for each of the genes and for the two concatenated sets using PhyML-aLRT (Guindon & Gascuel, 2003). The substitution model was HKY85 (Hasegawa et al., 1985), which allows for different base frequencies. The transition/transversion ratios, the proportion of invariable sites and the Gamma distribution parameter that defines the shape of a gamma distribution were estimated by maximizing the likelihood of the phylogeny. A discrete-gamma distribution was used to allow for differing substitution rates between the sites and the substitution rate category was set to four. The starting tree(s) to be refined by the maximum-likelihood algorithm was a BIONJ (Gascuel, 1997) distance-based tree. This starting tree was optimized for topology, branch lengths and rate parameters (transition/transversion ratio, proportion of invariant sites, gamma distribution parameter). The aLRT statistics were used to check whether the branch being studied provided a significant gain in likelihood compared with the null hypothesis that involves collapsing that branch but leaving the topology of the rest of the tree unchanged (Anisimova & Gascuel, 2006). The aLRT statistics were interpreted using the non-parametric Shimodaira–Hasegawa-like procedure. The implementation of aLRT in PHYML was conducted according to J.-F. Dufayard from the original PHYML ().
The congruency of the single gene phylogenies was computed using CONSENSE from the PHYLIP package (Felsenstein, 2004). A supertree was computed from the seven individual gene phylogenies using the Super Distance Matrix (SDM) method (Criscuolo et al., 2006). The SDM method distorts the distance matrices of the source trees, without modifying their topological message, in order to bring them as close as possible to each other, and then integrates the evolutionary distances obtained from each gene into a single distance supermatrix. This new supermatrix was used for tree construction using FITCH with the PHYLIP package (Felsenstein, 2004). SDM was also used to compute an estimation of the evolution rates for each gene in the tree collection.
For each of the seven genes, the individual evolution rate of each strain was estimated from the distance between each leaf and the previous common node. These evolution rates were submitted to principal component analysis (PCA) using the R package. This analysis was only performed for the R set of 87 strains.
Phylogeny of mycobacteria using the concatenation of seven genes with maximum-likelihood and supertree SDMWe tested each of the seven single gene trees created with ML for intertree congruence and also against the concatenated tree. Congruence between individual genes made it possible to reconstruct the phylogenies by concatenating the genes as it provided a guarantee of the orthology of the genes and thus of the absence of HGT (Bapteste et al., 2005; Hamady et al., 2006). Congruence was evaluated by comparing the clusters present in each tree. The results reported in Table 1 show an acceptable level of congruency.
Concatenation is considered to be a low-level (or total evidence) method of supertree making. The technique can be hindered by missing data (phylogenetic accuracy is reduced), and by differences between the evolution rates (Daubin et al., 2001; Criscuolo et al., 2006). Missing data are therefore often excluded, which is what we did in the first part of this study, but this is not unavoidable if ML techniques are used for the phylogenetic reconstruction (Philippe et al., 2004). The total length of the concatenation was 2925 bp. A phylogenetic tree using neighbour-joining (NJ) and bootstrapping was constructed as a prerequisite (available as Supplementary Figure S1 in IJSEM online). By comparing the NJ and ML methods, the trees of the concatenated R set were found to be very similar. The usual clusters were globally respected as were the pathogen clusters. The first concatenated tree created was for the 87 strains in the R set (Fig. 1). The discriminating power was greater than that of the individual trees. The clusters that were supported by aLRT statistics of greater than 0.75 were identified. Clusters that were unresolved to the individual gene level were split into smaller clades, apart from the Mycobacterium tuberculosis complex (MTBC), the Mycobacterium avium complex (MAC), the Mycobacterium fortuitum subspecies and Mycobacterium piscinum–Mycobacterium smegmatis. Combined analysis increases the phylogenetic signal and reduces the discrepancies between individual trees, leading to improved accuracy of the phylogenetic reconstruction (Bapteste et al., 2005; Cantarel et al., 2006). Furthermore, nodes that were weakly supported by statistical analysis showed increasing levels of support in the concatenated tree, as previously noted by Brown et al. (2001). Support was enhanced, with 74 % of data having aLRT statistics greater than 0.75 (Table 1). The resulting phylogeny established a clear distinction between the RG and SG strains, with an early division into deep nodes that was well supported by aLRT statistics. The Mycobacterium terrae complex (M. terrae, Mycobacterium nonchromogenicum, Mycobacterium triviale and Mycobacterium hiberniae) was located within the SG group, where it was the most external cluster in the group.
|
We were not prepared to accept a situation in which only 87 of the 132 strains could be used for the phylogenetic analysis. We therefore studied the strains of the W set. We found that the phylogeny based on all 132 strains was no different from that for the 87 strains (Fig. 2). The clades inferred from the R set were also found in the W set-based tree, with only minor differences in their exact cluster positions. The SG/RG separation was identified in early nodes with far better support from aLRT statistics. Likewise, pathogen clusters were well conserved, such as Mycobacterium bolletii, Mycobacterium chelonae, Mycobacterium abscessus and Mycobacterium immunogenum, all of which remained grouped together. The phylogenetic positioning of the M. terrae complex was not clearly defined. The new position was outside both groups. M. triviale did not cluster with the other members of the M. terrae group. This new analysis maintained the previous topology, but it also made it possible to include species for which only incomplete data were available, which was a significant advantage. Overall, the aLRT statistics were better.
|
The second supertree method using SDM offered a good compromise for phylogenetic reconstruction involving a large number of species, notably with regard to execution time. Moreover, this method is said to allow for variable evolution rates and it yields phylogenies in which individual phylogenies are better conserved (Criscuolo et al., 2006). The supertree created with the complete data alone (R set) resulted in a phylogeny similar to that obtained by concatenation (Fig. 3). The SG were separated from the RG early in the nodes. The M. terrae complex occupied an unusual position, being the most outlying member of the RG group. The pathogenic species were clearly clustered, for instance the M. kansasii group and Mycobacterium haemophilum that was close to M. leprae. As expected from the individual trees, the resulting tree clearly separated species, but did not separate members of the MTBC, MAC or the subspecies of M. fortuitum. The SDM tree for the W set was different (Fig. 4). This is the first tree that really places the SG as a subgroup in the evolution of RG, as the separating node is located later. The tree proposes two groups of RG and also that the SG group is the final outcome of the evolution of the second RG group. The SDM tree cannot be bootstrapped or statistically tested as it is the best topological compromise between the individual trees. Its interpretation is slightly different from that of the tree resulting from concatenation. Its main strengths are that overall it preserves the individual trees and also takes into consideration the rate of evolution. The M. terrae complex was correctly positioned within the SG group by this last version of the analysis.
|
|
The congruence between the supertrees (concatenated and SDM) and sets R and W was checked using TREEDYN and the clusters were coloured by multitree graphical analysis (Chevenet et al., 2006) (graphs are available as Supplementary Figs S2 and S3 in IJSEM Online). Using the R set, a comparison of the concatenated tree and the SDM tree revealed few differences. The M. terrae complex was not stable, as it shifted from the RG to the SG cluster. When considering the species of the SG, the groups were well conserved by both methods, apart from Mycobacterium triplex and Mycobacterium conspicuum. In contrast, when considering the species of the RG for the eight clusters created in the concatenated tree, we found that three were modified in the SDM tree. The analysis of the W set led to further modifications. Overall, SG clusters were well conserved in both trees, but RG groups were split up more often. The major difference was the situation of the whole SG group in the SDM tree. At the beginning of the study, we had no evidence in favour of any specific method. The major advantage of concatenation, which seems to be the method most often used for bacterial phylogeny, is that additional data, even if incomplete (as shown here), do not modify the global phylogeny. The M. terrae complex is unusual because of its versatile position. This position did not seem to be due to missing data as only one gene was missing for two species and the two remaining species were closely linked in both trees. The unusual positions were also visible in individual trees (data not shown).
Concatenation therefore made it possible to position novel species with an acceptable degree of confidence. The supertree method using SDM appeared to be one of the best programs at the bioinformatic level, but its main advantage was topological as it conserved the global topology of individual phylogenies. This method has previously been studied and used on genomic data with larger genomic distances than bacteria (i.e. mammals) (Criscuolo et al., 2006). In the present study, we examined data from a bacterial genus consisting of 132 strains (125 type strains) that were genomically closer than mammalian strains. To be able to assess this method and find out whether it really does provide a true representation of the history of evolution, we need additional data for the bacteria. SDM also provided a new scenario for the evolution of mycobacteria and it has not yet been possible to work out which scenario is the most accurate.
In the course of this study we performed a multilocus analysis as recommended by the ad hoc committee, using the sequences of five housekeeping genes. Adding two highly conserved ribosomal genes did not hinder the phylogenetic analysis. We also tested the hypothesis using only the five housekeeping genes and obtained results that were similar (data not shown). This matches the results observed by Daubin et al. (2001), when two independent trees were built up from informational and operational genes and no differences were demonstrated between the two trees.
We observed that the two reconstruction methods did not take the variability of ER properly into account. The concatenation attenuated the ER information, whereas SDM conserved it better.
Different evolution rates between genes and between species for a given gene
The ER was estimated by SDM, 1/αp providing a particularly good estimation. From the results presented in Table 2, it is apparent that smpB and sodA are the most rapidly evolving genes, whereas 16S rRNA is the slowest evolving gene. We represented the ER of each strain for the seven genes in the principal component analysis (PCA). Axes 1 and 2 were used to explore 67 % of the variance (data not shown). The 16S rRNA and smpB genes were in opposition along axis 1. This is consistent with the fact that the ERs of most of the species show opposite tendencies for these two genes. The tmRNA gene made the greatest contribution to axis 3. The other genes did not make any significant contribution to any axis. When displaying the strains along axes 1 or 2, i.e. with smpB and 16S rRNA participation, we clearly identified two groups, one including RG mycobacteria was in one area, with SG mycobacteria in another. The RG group was homogeneous and the ER for the 16S rRNA gene was higher in the RG group than for the SG group. The reverse was true for the smpB gene (SG evolved more rapidly than RG). On the basis of the ER, the M. terrae complex was included with RG, whereas these strains had been classified (based on their phenotypic characteristics) as SG mycobacteria. The SG group was more scrambled than the RG group. The slowest-evolving species for the smpB gene were the M. kansasii cluster, the MTBC and M. leprae/M. haemophilum. The most rapidly evolving were the MAC and M. intracellulare/M. chimaera. The other species formed a continuum between these two groups.
Table 2. Evaluation of the ER of each gene calculated for sets R and W ER was calculated (1/αp) using SDM for both sets of data. Set R=87 species, set W=132 species.
This clear division of mycobacteria into two groups was observed only for these two genes on the basis of species ER. We were not able to link this observation to the contrasting ERs of the two genes. We cannot explain why these two genes with differing ER values clearly separated the two groups of mycobacteria. As suggested by Ventura et al. (2006), there could be a relationship between bacterial gene evolution and gene function when differences in ER are observed.
The genes examined in this study were independent. It could be useful for a global representation of phylogeny to know that two rapidly evolving genes (smpB and sodA) were buffered by two slowly evolving ones (16S rRNA and tmRNA), but this also had the drawback of reducing the phylogenetic signal. Choosing the best predictor genes for bacterial phylogeny remains a major problem for biologists. The criteria proposed by Zeigler (2003) seem to be essential. We have added the notion of evolution rate to the criteria and have shown that there is no absolute need for a complete set of data in MLSA.
This study examined an important aspect of using genomic sequences to address evolutionary questions. We studied the whole collection of mycobacterial type strains (as available at 1 January 2005) by a combined analysis of seven genes and two reliable phylogenetic reconstruction methods. Concatenation resulted in a well-supported phylogeny in which the missing data did not influence the topology. We can therefore conclude that this method will tolerate additional data well, even if they are incomplete, and still provide a reliable positioning of novel strains. The SDM supertree method was better at preserving both the individual tree topology and the variability of ER. When incomplete data were added, a different topology was obtained. We conclude that the genes studied had differing ERs and noticed that SG and RG strains tend to have contrasting ERs. The phylogenetic approach must take this variability into account. Gene concatenation is the commonest method used in phylogenetic reconstruction, but this is unfortunate as this method does not take the ER of the genes into account. Further studies are required before any conclusions can be drawn about the variability of ER or about the usefulness of SDM for bacterial phylogenetic reconstruction.
We are grateful to G. Fardel for technical assistance. We thank V. Guerin for her help with PCA.References
Afghani, B. & Stutman, H. R. (1996). Polymerase chain reaction for diagnostic of M. tuberculosis: comparison of simple boiling and a conventional method for DNA extraction. Biochem Mol Med 57, 14–18.[CrossRef][Medline]
Anisimova, M. & Gascuel, O. (2006). Approximate likelihood-ratio test for branches: a fast, accurate and powerful alternative. Syst Biol 55, 539–552.[CrossRef][Medline]
Bapteste, E., Susko, E., Leigh, J., MacLeod, D., Charlebois, R. L. & Doolittle, W. F. (2005). Do orthologous gene phylogenies really support tree-thinking? BMC Evol Biol 5, 33[CrossRef][Medline]
Blackwood, K. S., He, C., Gunton, J., Turenne, C. Y., Wolfe, J. & Kabani, A. M. (2000). Evaluation of recA sequences for identification of Mycobacterium species. J Clin Microbiol 38, 2846–2852.
Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. (2001). Universal trees based on large combined protein sequence data sets. Nat Genet 28, 281–285.[CrossRef][Medline]
Cantarel, B. L., Morrison, H. G. & Pearson, W. (2006). Exploring the relationships between sequences similarity and accurate phylogenetic trees. Mol Biol Evol 23, 2090–2100.
Charles, L., Carbone, I., Davies, K. G., Bird, D., Burke, M., Kerry, B. R. & Opperman, C. H. (2005). Phylogenetic analysis of Pasteuria penetrans by use of multiple genetic loci. J Bacteriol 187, 5700–5708.
Chevenet, F., Brun, C., Banuls, A. L., Jacq, B. & Christen, R. (2006). TreeDyn: towards dynamic graphics and annotations for analyses of trees. BMC Bioinformatics 7, 439[CrossRef][Medline]
Criscuolo, A., Berry, V., Douzery, E. J. & Gascuel, O. (2006). SDM: a fast distance-based approach for (super) tree building in phylogenomics. Syst Biol 55, 740–755.[CrossRef][Medline]
Daubin, V., Gouy, M. & Perriere, G. (2001). Bacterial molecular phylogeny using supertree approach. Genome Informatics 12, 155–164.[Medline]
Devulder, G., Pérouse de Montclos, M. & Flandrois, J. P. (2005). A multigene approach to phylogenetic analysis using the genus Mycobacterium as a model. Int J Syst Evol Microbiol 55, 293–302.
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with accuracy and high throughput. Nucleic Acids Res 32, 1792–1797.
Felsenstein, J. (2004). PHYLIP (phylogeny inference package), version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle, USA.
Galtier, N., Gouy, M. & Gautier, C. (1996). SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. Comput Appl Biosci 12, 543–548.
Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 14, 685–695.[Abstract]
Guindon, S. & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52, 696–704.
Hamady, M., Betterton, M. D. & Knight, R. (2006). Using the nucleotide substitution rate matrix to detect horizontal gene transfer. BMC Bioinformatics 7, 476[CrossRef][Medline]
Hasegawa, M., Kishino, H. & Yano, R. A. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22, 160–174.[CrossRef][Medline]
Kim, B. J., Lee, S. H., Lyu, M. A., Kim, S. J., Bai, G. H., Him, S. J., Chae, G. T., Kim, E. C., Cha, C. Y. & Kook, Y. H. (1999). Identification of mycobacterial species by comparative sequence analysis of the RNA polymerase gene (rpoB). J Clin Microbiol 37, 1714–1720.
Kim, H., Kim, S. H., Shim, T. S., Kim, M., Bai, G. H., Park, Y. G., Lee, S. H., Chae, G. T., Cha, C. Y. & other authors (2005). Differentiation of Mycobacterium species by analysis of the heat-shock protein 65 gene (hsp65). Int J Syst Evol Microbiol 55, 1649–1656.
Mignard, S. & Flandrois, J. P. (2007). Identification of Mycobacterium using the EF-tu encoding (tuf) gene and the tmRNA encoding (ssrA) gene. J Med Microbiol 56, 1033–1041.
Philippe, H., Snell, E. A., Bapteste, E., Lopez, P., Holland, P. W. H. & Casane, D. (2004). Phylogenomics of Eukaryotes: impact of missing data on large alignments. Mol Biol Evol 21, 1740–1752.
Picardeau, M., Prod'hom, G., Raskine, L., Lepennec, M. P. & Vincent, V. (1997). Genotypic characterization of five subspecies of Mycobacterium kansasii. J Clin Microbiol 35, 25–32.
Stackebrandt, E., Frederiksen, W., Garrity, G.M., Grimont, P. A., Kämpfer, P., Maiden, M. C., Nesme, X., Rosselló-Mora, R., Swings, J. & other authors (2002). Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 52, 1043–1047.[Abstract]
Staley, J. T. (2006). The bacterial species dilemma and the genomic-phylogenetic species concept. Phil Trans R Soc B 361, 1899–1909.[CrossRef][Medline]
Tortoli, E. (2003). Impact of genotypic studies on Mycobacteria taxonomy: the new Mycobacteria of the 1990s. Clin Microbiol Rev 16, 319–354.
Ventura, M., Canchaya, C., Del Casale, A., Dellaglio, F., Neviani, E., Fitzgerald, G. F. & Van Sinderen, D. (2006). Analysis of bifidobacterial evolution using a multilocus approach. Int J Syst Evol Microbiol 56, 2783–2792.
Zeigler, D. R. (2003). Gene sequences useful for predicting relatedness of whole genomes in bacteria. Int J Syst Evol Microbiol 53, 1893–1900.
Zolg, J. W. & Philippi-Schulz, S. (1994). The superoxide dismutase gene as a target for detection and identification of mycobacteria by PCR. J Clin Microbiol 32, 2801–2812.