Genomic analysis of diverse rubella virus genotypes

Abstract

Based on the sequence of the E1 glycoprotein gene, two clades and ten genotypes of Rubella virus have been distinguished; however, genomic sequences have been determined for viruses in only two of these genotypes. In this report, genomic sequences for viruses in an additional six genotypes were determined. The genome was found to be well conserved. The viruses in all eight of these genotypes had the same number of nucleotides in each of the two open reading frames (ORFs) and the untranslated regions (UTRs) at the 5' and 3' ends of the genome. Only the UTR between the ORFs (the junction region) exhibited differences in length. Of the nucleotides in the genome, 78 % were invariant. The greatest observed distance between viruses in different genotypes was 8.74 % and the maximum calculated genetic distance was 14.78 substitutions in 100 sites. This degree of variability was similar among regions of the genome with two exceptions, both within the P150 non-structural protein gene: the N-terminal region that encodes the methyl/guanylyltransferase domain was less variable, whereas the hypervariable domain in the middle of the gene was more divergent. Comparative phylogenetic analysis of different regions of the genome was done, using sequences from 43 viruses of the non-structural protease (near the 5' end of the genome), the junction region (the middle) and the E1 gene (the 3' end). Phylogenetic segregation of sequences from these three genomic regions was similar with the exception of genotype 1B viruses, among which a recombinational event near the junction region was identified.

Supplementary tables are available in JGV Online.

Rubella virus is an important human pathogen that causes an acute, contagious disease known as rubella, 3-day measles or German measles, and severe birth defects (known as congenital rubella syndrome) when infection occurs during the first trimester of pregnancy (Chantler et al., 2001). Rubella virus is the single member of the genus Rubivirus in the family Togaviridae and is an enveloped, single-stranded, positive-polarity RNA virus with a genome of approximately 10 kb. The genome contains two long open reading frames (ORFs): the 5'-proximal ORF (NSP-ORF) encodes two non-structural proteins, P150 and P90, that function in RNA replication, and the 3'-proximal ORF (SP-ORF) encodes three structural proteins: the capsid protein, C, and two envelope glycoproteins, E1 and E2. The SP-ORF is translated from a subgenomic RNA synthesized in infected cells (Frey, 1994). The genome also contains untranslated regions (UTRs) at its 5' and 3' ends and between the ORFs (known as the junction region).

Although rubella occurs worldwide, vaccination efforts with live-attenuated vaccines have been concentrated in developed countries. Currently, approximately 50 % of countries have national vaccination efforts against rubella (Robertson et al., 2003). Isolation and genetic sequencing of rubella viruses has been most thorough in countries pursuing elimination (Bosma et al., 1996; Frey et al., 1998; Icenogle et al., 2006; Katow, 2004; Katow et al., 1997a, b; Reef et al., 2002; Saitoh et al., 2006); however, collections have recently been assembled from other regions of the world (Donadio et al., 2003; Katow, 2004; Zheng et al., 2003a, c). Recently, a standard taxonomy for rubella viruses was adopted based on sequences of a standard window within the E1 gene and supported by sequencing of the SP-ORF of selected viruses (WHO, 2005). The taxonomy consists of two clades [corresponding to the previous genotypes I and II (Frey et al., 1998; Zheng et al., 2003a)] containing a total of ten genotypes, seven in clade 1 (1a, 1B, 1C, 1D, 1E, 1F and 1g) and three in clade 2 (2A, 2B and 2c); the genotypes designated in lower case are provisional. Within the E1 gene, maximal variation among clade 1 viruses is 5.8 %, that among clade 2 viruses is 8.0 %, and it is 8.2 % between the two clades (Zheng et al., 2003a). Geographically, clade 1 viruses circulate worldwide, whilst clade 2 viruses thus far have been restricted to Eurasia.

Thus far, ten complete genomic sequences of Rubella virus have been reported, which represent only two of the ten genotypes (eight sequences of genotype 1a viruses and two sequences of genotype 2A viruses). Among these sequences, genomic genetic variability is similar to that in the E1 gene, with the exception of a hypervariable region (HVR) of greater variability in the middle of the P150 non-structural protein gene (Hofmann et al., 2003; Zheng et al., 2003b). Given the lack of representation of the majority of the genotypes in the current genomic database, the first goal of this study was to expand the number of genomic sequences, using viruses in our collection from six additional genotypes (1B, 1C, 1D, 1E, 2B and 2C). The second goal of this study was to extend phylogenetic analysis to 5' regions of the genome, which had not previously been done. To this end, the sequence of the non-structural protease-encoding region within the P150 gene was determined and compared phylogenetically with the sequences of the junction region and the E1 gene from 43 viruses representing eight genotypes.

Viruses, cells, RNA extraction, cDNA amplification and DNA sequencing.
The viruses analysed in this study are listed in Table 1 (genomic sequences) and Supplementary Table S1, available in JGV Online (genomic regions). The Cba strain was provided by Dr Marta Zapata, University of Cordoba, Cordoba, Argentina. A monolayer of Vero cells (25 cm² T-flask or 60 mm plate) was infected with each virus. Three to five days post-infection, the culture medium was removed and total cellular RNA was extracted by using Tri-Reagent (Molecular Research Center) using the manufacturer's protocol. The RNA extracted from one 25 cm² T-flask or one 60 mm plate was resuspended in 50 µl double-distilled H₂O and stored at 80 °C until use. cDNA was synthesized in a 20 µl total reaction volume reaction containing 5 µl denatured (95 °C, 5 min) RNA template, 4 µl 5x Reverse Superscript buffer (Invitrogen), 4 µl 2.5 mM dNTPs, 1 µl 0.1 M dithiothreitol, 1 µl 4 µM 3'E1/808 reverse primer (5'-TTTTTTTTTCTATACAGCAAC-3'; T₉ followed by the complement of nt 97519762 of the rubella virus genome), 1 µl (40 units) RNasin (Promega) and 1 µl (200 units) Superscript reverse transcriptase III (Invitrogen). The reaction was incubated at 55 °C for 60 min and then stored at 20 °C prior to use in PCR. Each 50 µl PCR contained 25 µl 2x GC buffer I&II (TaKaRa), 8 µl 2.5 mM dNTPs, 3 µl cDNA template, 1 µl 40 µM appropriate forward and reverse primers and 0.5 µl (2.5 units) LA Taq polymerase (TaKaRa). Cycling parameters were determined according to the manufacturer's (TaKaRa) recommendations. For genomic sequencing, 1011 overlapping fragments encompassing the entire genome were amplified by using appropriate primer pairs. The primers used to amplify the genomic regions are listed in Supplementary Table S2, available in JGV Online. Amplified fragments were purified following agarose-gel electrophoresis by using a QIAquick gel extraction kit (Qiagen). Sequencing reactions were performed bidirectionally by using appropriate primers and cycle-sequencing kits (ABI PRISM BigDye Terminator v. 3.1; PE Applied Biosystems) and resolved by using a 3100 Genetic Analyzer (Applied Biosystems). The 5' and 3' ends of the genome were determined by using a 5'/3' FirstChoice RLM-RACE kit (Ambion Inc.).

Table 1. Rubella virus genomic sequences used in this study

Sequence analysis.
For cataloguing and storage, sequences were input into free online sequence-alignment software (ALIGN Query, GENESTREAM SEARCH network server IGH, Montpellier, France; ). The assembled nucleotide sequences were aligned by using the CLUSTAL_W multiple sequence-alignment program version 1.8 (Henikoff & Henikoff, 1994) and the PileUp program in the GCG software package (Genetics Computer Group, version 11.0; Accelrys Inc.). The TN93 substitution model (Tamura & Nei, 1993) with discrete gamma-distributed rate heterogeneity with eight gamma rate categories (Yang, 1994) (TN93+γ model) was used as a substitution model for phylogenetic reconstruction, as it was found statistically to be the best fit for our datasets. Maximum-likelihood (ML) phylogenetic analysis was performed by using the TREE-PUZZLE program version 5.2 (Strimmer & von Haeseler, 1996). ML genetic distances and nucleotide-substitution statistical parameters were estimated under the selected TN93+γ substitution model with an initial neighbour-joining tree and then the best ML tree was reconstructed with these optimized parameters by using the quartet-puzzling method in the TREE-PUZZLE program. Fifty thousand and one hundred thousand quartet-puzzling steps were performed in constructing trees from the 19 genomic and 43 genomic region sequences, respectively. Sequence similarities and observed distances were calculated by using the Old Distance program in the GCG software package. A nucleotide sequence PLOTSIMILARITY plot across the genome (100 nt window) was generated by using the PLOTSIMILARITY program in the GCG software package. As the sequences of genotype 1a viruses were over-represented, the plot was generated by using six sequences from each clade, including members from each genotype. Nucleotide sequence substitution-rate analysis was carried out with PILEUP (GCG package), fastDNAml (version 1.2.2) and DNArates (version 1.1.0), employing default parameters. To detect recombination, phylogenetic analysis of sequences on either side of putative break points was conducted by using TREE-PUZZLE with the same parameter settings as were used in the genomic sequence analysis. Recombination was also analysed by using the sequence recombination-detection programs TOPALi (Milne et al., 2004), RIP 2.0 (Recombination Identification Program; ) and the four-cluster likelihood mapping analysis in the TREE-PUZZLE program. Genomic sequences and comparisons
A representative rubella virus phylogenetic tree based on the standard E1 gene window recommended by the WHO (nt 82919469) and containing the reference viruses for each genotype and the ten viruses for which the genomic sequence has been determined is shown in Fig. 1. As the genomic sequences were from genotype 1a and 2A viruses, the genomic sequences of nine representative viruses from six additional genotypes were determined (Table 1). Among these 19 viruses, with three exceptions, the genomes were 9762 nt in length and consisted (5'3') of a 40 nt 5' UTR, a 6351 nt NSP-ORF, a 120 nt junction region, a 3192 nt SP-ORF and a 59 nt 3' UTR. All three exceptions were in the junction region: the genome of one of the genotype 1B viruses (GUZ_GER92) was 9760 nt in length because it had a deletion of 2 nt at positions 64806481 (between the end of the NSP-ORF and the SG RNA start site) and the genomes of both genotype 2B viruses were 9761 nt in length because they had a deletion of 1 nt at position 6422 (between the SG RNA start site and the start of the SP-ORF). Fig. 2 shows a similarity plot across 12 genomic sequences proportionally representing all eight genotypes. Overall variability averaged approximately 7 % and was roughly comparable across the genome, with the exceptions of the 5'-terminal approximately 400 nt, a region encoding the methyl/guanylyltransferase (MT) domain within the P150 gene that exhibited variability of approximately 4 %, and a region of the P150 gene encompassing nt 21002400, the HVR, in which local variability peaked at up to 18 %. Pairwise observed genomic distances between viruses in different genotypes (see Supplementary Table S3, available in JGV Online) ranged from 2.0 to 8.7 %. The range of pairwise observed distances for genomic regions (the five genes and domains within P150 and P90) and the 3' cis-acting element (3'CAE) is given in Table 2. Maximal observed distances of these regions were shown to range from 8.29 to 11.44 %, with the exception of the MT domain (5.24 %) and the HVR (21.18 %). Given this limited degree of variability across most of the genome, it was not suprising that 78 % of the nucleotides in the genome were invariant across the 19 sequences (Table 2).

(28K):

Fig. 1. Rubella virus E1-based phylogenetic tree, constructed from sequences available in GenBank using the E1 gene window (nt 82919469) with MrBayes software (version 3.1; 600 000 ngen, 100 samplefreq, 4 nchains, 250 burnin) recommended in the WHO Rubella Nomenclature Report (WHO, 2005). Clades and genotypes are shown (two sub-branches within genotype 1B are indicated by dashed lines). Reference strains of each genotype designated in the WHO report are indicated by Ref, strains whose genomic sequence has been reported previously are indicated by stars and strains whose genomic sequences were determined in this study are indicated by arrows.

(27K):

Fig. 2. Rubella virus genome nucleotide-similarity plot. The plot was generated from 12 genome sequences proportionally representing the two clades and eight genotypes by using the PLOTSIMILARITY program in the GCG software package with a 100 nt window. A genomic map is overlaid above the plot that shows the two ORFs, five genes, regions/domains within the NS-ORF (MT, methyl/guanylyltransferase; HVR, hypervariable region; XD, X domain; NP, non-structural protease; HEL, helicase; RdRp, RNA-dependent RNA polymerase or replicase) and the 3' cis-acting elements (3'CAE). The regions from which sequences were determined from 43 viruses for comprehensive phylogenetic analysis (see Fig. 5) are also denoted (NP, non-structural protease; JR, junction region; E1, WHO standard E1 window).

Table 2. Maximum-likelihood parameters calculated for regions of the rubella virus genome

The additional sequences contributed by this report greatly expand information on the genomic diversity of rubella viruses and, therefore, we took the opportunity to calculate a number of evolutionary parameters (Table 2). The maximum pairwise genetic distance (see Supplementary Table S3, available in JGV Online) was 14.78 substitutions in 100 sites, greater than the largest observed distance (8.74 observed substitutions in 100 sites). The maximal genetic distances for the genomic regions (the five genes, domains within P150 and P90 and the 3'CAE) ranged from 13.97 to 23.0 substitutions in 100 sites, with the exception of the MT domain (6.77 substitutions in 100 sites) and the HVR (35.85 substitutions in 100 sites). The transition/transversion site parameter (K) was 7.04 for the entire genome and ranged among genomic regions from 4.5 (HVR) to 13.35 (3'CAE). The pyrimidine and purine (Y/R) transition parameter of entire genome was 2.7 and varied among the genomic regions from 0.93 (X domain of the P150 gene) to 3.7 (C gene). To test whether the observed distance simply underestimates the genetic distance or whether substitution saturation has been reached, the pairwise number of transitions and transversions was plotted as a function of the calculated genetic distance (by using DAMBE; Xia & Xie, 2001), with the result that both transitions and transversions increased linearly with genetic distance, with the number of transitions being higher than transversions (data not shown). Neither reached a plateau, indicating that substitution saturation had not occurred.

Among the 19 genomic sequences, 78 % of the nucleotides were invariant. Not surprisingly, the parameter of rate heterogeneity, α, was 0.22 for the entire genome and varied between 0.19 and 0.33 across the genomic regions, with the exception of the HVR, within which α=1.35. These small α values indicated a strong substitution-rate heterogeneity among nucleotide sites across most of the genome (i.e. more than three-quarters of the nucleotides remained constant, whilst fewer than one-quarter exhibited variability). Within the HVR, 46 % of the nucleotides were variable.

Phylogenetic analysis
ML phylogenetic trees constructed from the complete genomic sequences, as well as from the NSP- and SP-ORFs, are displayed in Fig. 3. As in the E1-based tree, the six clade 2 sequences formed a clear, consistent branching pattern with high support values in all three trees, indicating that genotypes 2A and 2B are related more closely to each other than to genotype 2c. In clade 1, the groupings of genotypes 1B, 1C, 1D and 1E on the three trees were consistent: the two genotype 1B sequences formed a branch, as did the individual genotype 1D and 1E sequences, whilst the individual genotype 1C sequence extended from the baseline with no relative relationship to other genotypes. On the genomic and SP-ORF trees, the eight genotype 1a sequences grouped into four pairs, indicated in Fig. 3 as a1 (TO-w and TO-v; a wild-type parent and the attenuated vaccine derived from it), a2 (Fth and RA27/3; both isolated in the north-eastern USA in 1964), a3 (CEN and M33; isolated from Europe and the USA in 19611962) and a4 (SUR and ULR; both isolated from eastern Europe in 1974 and 1984). Interestingly, on the NSP-ORF tree, M33 separated from CEN (a3) and clustered with the a4 grouping. We also constructed trees by using the sequences of the genes and regions within the NSP-ORF and SP-ORF (data not shown), with the result that they had the same general topology as the ORF-generated trees. The exception was the HVR-generated tree, on which each of the clade 1 viruses formed an individual branch, apart from the a1 and M33-SUR groupings, which were preserved.

(24K):

Fig. 3. Phylogenetic trees of genomic, NS-ORF and SP-ORF sequences. Trees were constructed by using TREE-PUZZLE (version 5.2) with 50 000 puzzle steps; reliability values are indicated on each node. Genotypes are denoted, including four subclusters of genotype 1a (1a-1 to 1a-4). Genetic distance (substitutions in 100 nt) calculated by using the TN93+γ substitution model is indicated by the bar below each tree.

Extensive phylogenetic analysis has not been done previously using sequences from 5' regions of the genome. To do so, trees were constructed from the sequences of the non-structural protease (NP) region (nt 30353973; Fig. 2), the junction region and adjacent sequences (JR; nt 63516829, which includes the 3' end of the P90 gene, the UTR between the ORFs and the 5' end of the C gene) and the E1 gene [nt 87319469, the recommended window for routine genotyping (WHO, 2005)] of 43 viruses representing eight genotypes. ML phylogenetic trees constructed from these sequences are shown in Fig. 4(a). Clustering of viruses on the three trees was similar, with the exception of a group of seven genotype 1B viruses that formed a single branch on the NP tree and two branches (one of five and one of two viruses) on the JR tree, but did not form a cluster on the E1 tree. The single-nucleotide deletion at nt 6422 in the junction region detected in the genome sequence of the two genotype 2B viruses was confirmed in the JR sequences determined from three additional genotype 2B viruses, and the 2 nt deletion at nt 64806481 detected in the genome sequence of one of the two genotype 1B viruses was discovered in the JR sequences determined from four of the five additional genotype 1B viruses. Interestingly, this 2 nt deletion did not co-segregate with the JR sequence of these viruses, as it was present in three members of the five-virus genotype 1B branch and both members of the two-virus genotype 1B branch [marked by an asterisk in Fig. 4(a)]. Evolutionary parameters calculated from these larger and more genotypically representative sequence sets, shown in Table 2, were similar to those calculated by using the smaller genome sequence set.

(39K):

Fig. 4. Phylogenetic trees based on sequence of genomic regions. (a) Trees were constructed using the sequence of the (i) non-structural protease (NP, nt 30353973), (ii) junction region (JR, nt 63516829, including the 3' end of the NS-ORF, the UTR between the ORFs and the 5' end of the C gene) and (iii) WHO standard E1 window (nt 87319469). On all three trees, genotype 1B viruses are indicated by dots (filled or empty for the clusters of five or two viruses, respectively, on the JR-based tree), and in (ii), genotype 1B viruses with a 2 nt deletion at nt 64806481 are indicated by stars. (b) Trees were constructed by using nt 57206554 (i) or 65556814 (ii), on either side of the putative recombination break point. Differential segregation of two of the genotype 1B viruses is indicated by arrows. All trees were constructed with TREE-PUZZLE (version 5.2; 100 000 puzzle steps). Reliability values are indicated on each node and genetic distance (substitutions in 100 nt) calculated by using the TN93+γ substitution model is indicated by the bar below each tree.

Detection of genomic recombination among genotype 1B viruses
The lack of co-segregation of the genotype 1B JR sequences and the 2 nt deletion led us to hypothesize that a recombinational event had occurred in this region of the genome of these viruses, at or downstream of the deletion. To test this hypothesis, we expanded the sequence determined upstream into the 3' end of the P90 gene (the RdRp domain) and employed several software programs designed to detect recombination events, as well as secondary phylogenetic analysis, to detect putative break points. The RIP program predicted a break point at nt 6555, within the 5' end of the C gene (which begins at nt 6512). As shown in Fig. 4(b), this prediction is supported by trees of sequences up- and downstream of this break point. Similar to the NP tree in Fig. 4(a), all seven genotype 1B sequences form a branch on the tree constructed by using sequences upstream of this break point (nt 57206554). On the tree constructed by using sequences downstream of the break point (nt 65556814), two sequences (TOM_UNK86 and 0754_GER92) were on a branch distinct from the other genotype 1B sequences, similar to the JR tree in Fig. 4(a). We thus conclude that a recombinational event occurred at or near this site during the evolution of these viruses. The goal of this study was to expand the rubella virus genomic sequence database to include viruses in the majority of the currently defined genotypes. Whilst ten genomic sequences had been reported previously, only two of the ten currently defined genotypes were represented. This study added nine genomic sequences representing an additional six genotypes, encompassing the most widely divergent genotypes. The most striking finding was the genomic uniformity of rubella viruses, as it was discovered that 78 % of the nucleotides in the genomes of the viruses from the eight genotypes were invariant and these viruses preserved identical genomic dimensions across the two ORFs and two of the three UTRs. Only in the junction region (the UTR between the ORFs) of genotype 2B viruses, which had a 1 nt deletion, and a subset of genotype 1B viruses, which had a 2 nt deletion, was any plasticity observed. Such strict uniformity of genomic topology is highly unusual among RNA viruses (Huang et al., 2004; Kang et al., 2004; Kinney et al., 1998; Saleh et al., 2003; Takahashi et al., 2003; Tarbatt et al., 1997; van Cuyck et al., 2003; Yang et al., 2004). Sequence diversity was also low; across the eight genotypes, the maximum observed distance was <9 % and the maximum calculated genetic distance was 14.8 substitutions in 100 sites. Regardless of this difference, substitution saturation had not occurred, indicating that, despite the limited sequence diversity among rubella viruses, sufficient phylogenetic signal was retained to support the groupings observed (Salemi & Vandamme, 2003; Xia, 2000; Xia & Xie, 2001).

A sequence-similarity profile revealed, with two exceptions, a comparable pattern across the genome with local windows of similarity and dissimilarity varying about a relatively uniform mean, indicating that most genomic regions, including both virion protein and replicase protein genes, were equally divergent. Both observed and genetic distances between these genomic regions were comparable. The two exceptions were both within the P150 gene, with the N-terminal MT domain exhibiting less variability and the internal HVR exhibiting greater variability. Although the MT domain was predicted to encode both methyl- and guanylyltransferase activities (Rozanov et al., 1992; neither activity has been demonstrated experimentally), the fact that 90 % of the nucleotide residues within this region are conserved raises the possibility that this region serves as a CAE in addition to encoding protein sequence. Consistent with this possibility, the phenotype of a cell culture-potentiating mutation discovered at nt 164 of the RA27/3 genome was found to be due to the nucleotide itself rather than to the encoded amino acid (Pugachev et al., 2000). Conservation of the MT domain sequence has also been observed in other alpha-like family viruses (Gouvea et al., 1998). The HVR encodes a proline- and arginine-rich domain of P150 termed the proline hinge (Koonin et al., 1992), although this domain contains several adaptor motifs that could serve to facilitate the association of P150 with other proteins. If this domain serves as a structural hinge between functional domains within the P150 protein, this could explain the lower constraint on sequence conservation within the HVR in comparison with the rest of the genome. On the other hand, Hofmann et al. (2003) reported data suggesting that the HVR among clade 1 viruses was under positive selection at the amino acid level. It should be pointed out that hypervariable region is a relative term, in that HVRs in the genomes of other viruses are often more variable than the rubella virus HVR. For example, in the hepatitis E virus HVR, variability is >50 % (Arankalle et al., 1999; Gouvea et al., 1998; Nishizawa et al., 2003; van Cuyck et al., 2003).

With the exception of the HVR, nucleotide residues or sites across the genome showed a strong heterogeneity in rate of divergence, as indicated by the low value of the rate-heterogeneity parameter α. Sequence collections with low α values exhibit an L-shaped distribution on a graph of number of sites versus rate of divergence, rather than the bell-shaped curve generated when α is 1 (or >1). The low α value reflects the fact that roughly 80 % of the residues in the rubella virus genome were invariant in this collection of genomic sequences. The percentage of invariant residues at first and second codon positions was 93 %, compared with 48 % at third codon positions (Y. Zhou, unpublished data), and thus maintenance of amino acid sequence is a substantial component of the conservation of nucleotide sequence. Among third codon positions, the G+C content was 81 mol%, compared with 63 mol% among first and second codon positions (Y. Zhou, unpublished data), and thus there was selection for G and C residues. This selection was also evident in the HVR, within which the G+C content was 81 mol%, compared with 70 mol% for the genome.

Among the nucleotide substitutions at the 20 % of genomic sites that exhibited variability, transitions were strikingly more abundant than transversions; across the genome, the transition to transversion ratio, K, was 7.0 and varied among genomic regions from 4.5 to 13.4. Thus, the rubella virus genome exhibited the transition over transversion preference that has been well documented in DNA genomes (Meyer et al., 1999; Salemi & Vandamme, 2003). This preference has been attributed to the facts that transitions are more likely to lead to silent mutations in amino acid sequence and that, during replication, it is more likely that a mutation to a nucleotide of equal size (transition) will occur than to a nucleotide of different size (transversion). In RNA genomes, the possibility of both GC and GU pairing would also favour transitions in the replication process. Interestingly, pyrimidine transitions were favoured over purine transitions by a ratio (Y/R) of 2.7 across the entire genome; Y/R varied from 0.9 to 3.7 in genomic regions. Within the HVR, the most variable region of the genome, both K and Y/R were lower than for the entire genome and most of the other genomic regions, indicating that the variability in this region was generated in part by relaxing of the genomic preference for pyrimidine transitions over transversions and purine transitions.

Phylogenetic analysis of rubella viruses has traditionally been done on the basis of E1 gene or subE1 gene sequences and a standard taxonomy was proposed recently, based on a window within the E1 gene, that was substantiated by using complete SP-ORF sequences (WHO, 2005). The second goal of this study was to extend phylogenetic analysis to the 5' region of the genome and we found that generally comparable trees, in terms of both overall variability and phylogenetic clustering, were generated with sequence windows in the NSP-ORF. The exception was a group of seven genotype 1B viruses that formed a branch in a tree based on NP sequence, but formed two branches on the basis of JR sequence. Intriguingly, a deletion in the junction region of five of these seven viruses did not segregate with the two phylogenetic branches on the JR tree. Analysis revealed a recombinational event, putatively near the 5' end of the C gene, that led to the generation of the two branches on the JR tree. There was one previous report of a natural recombination event in Rubella virus (in the E1 gene; Zheng et al., 2003a), but the origin of the recombinant strain was in doubt because one of the parents was related closely to a commonly used laboratory strain. Thus, this was the first conclusive evidence of rubella virus recombination in nature.

Interestingly, the E1 sequences of the seven genotype 1B viruses did not cluster on the E1-based tree and, in comparison with the NP- and JR-based trees, this could be due to divergence or additional recombinational events. As can be seen in the tree in Fig. 1, genotype 1B consists of two sub-branches that would not necessarily appear to be related if fewer sequences were employed (e.g. the E1-based tree in Fig. 4). It is also to be noted that all of the WHO reference strains are on one of these sub-branches. Thus, for this genotype, phylogenetic analysis using sequences from the NSP-ORF region of the genome could be useful in assessing relatedness.

This research was supported by a grant from the National Institutes of Health (AI21389). We thank Xianfeng Chen for software assistance, Duping Zheng, Suganthi Suppiah and Hui Zhao for preliminary sequence determinations and Ping Jiang for processing sequencing reactions and gels.

References

Arankalle, V. A., Paranjape, S., Emerson, S. U., Purcell, R. H. & Walimbe, A. M. (1999). Phylogenetic analysis of hepatitis E virus isolates from India (19761993). J Gen Virol 80, 16911700.[Abstract]

Bosma, T. J., Best, J. M., Corbett, K. M., Banatvala, J. E. & Starkey, W. G. (1996). Nucleotide sequence analysis of a major antigenic domain of the E1 glycoprotein of 22 rubella virus isolates. J Gen Virol 77, 25232530.[Abstract/Free Full Text]

Chantler, J. K., Wolinsky, J. S. & Tingle, A. (2001). Rubella virus. In Fields Virology, 4th edn, pp. 963990. Edited by D. M. Knipe & P. M. Howley. Philadelphia, PA: Lippincott Williams & Wilkins.

Clarke, D. M., Loo, T. W., Hui, I., Chong, P. & Gillam, S. (1987). Nucleotide sequence and in vitro expression of rubella virus 24S subgenomic messenger RNA encoding the structural proteins E1, E2 and C. Nucleic Acids Res 15, 30413057.[Abstract/Free Full Text]

Dominguez, G., Wang, C. Y. & Frey, T. K. (1990). Sequence of the genome RNA of rubella virus: evidence for genetic rearrangement during togavirus evolution. Virology 177, 225238.[CrossRef][Medline]

Donadio, F. F., Siqueira, M. M., Vyse, A., Jin, L. & Oliveira, S. A. (2003). The genomic analysis of rubella virus detected from outbreak and sporadic cases in Rio de Janeiro state, Brazil. J Clin Virol 27, 205209.[CrossRef][Medline]

Frey, T. K. (1994). Molecular biology of rubella virus. Adv Virus Res 44, 69160.[Medline]

Frey, T. K., Abernathy, E. S., Bosma, T. J., Starkey, W. G., Corbett, K. M., Best, J. M., Katow, S. & Weaver, S. C. (1998). Molecular analysis of rubella virus epidemiology across three continents, North America, Europe, and Asia, 1961-1997. J Infect Dis 178, 642650.[Medline]

Gouvea, V., Snellings, N., Popek, M. J., Longer, C. F. & Innis, B. L. (1998). Hepatitis E virus: complete genome sequence and phylogenetic analysis of a Nepali isolate. Virus Res 57, 2126.[CrossRef][Medline]

Henikoff, S. & Henikoff, J. G. (1994). Position-based sequence weights. J Mol Biol 243, 574578.[CrossRef][Medline]

Hofmann, J., Renz, M., Meyer, S., von Haeseler, A. & Liebert, U. G. (2003). Phylogenetic analysis of rubella virus including new genotype I isolates. Virus Res 96, 123128.[CrossRef][Medline]

Huang, F. F., Sun, Z. F., Emerson, S. U., Purcell, R. H., Shivaprasad, H. L., Pierson, F. W., Toth, T. E. & Meng, X. J. (2004). Determination and analysis of the complete genomic sequence of avian hepatitis E virus (avian HEV) and attempts to infect rhesus monkeys with avian HEV. J Gen Virol 85, 16091618.[Abstract/Free Full Text]

Icenogle, J. P., Frey, T. K., Abernathy, E., Reef, S. E., Schnurr, D. & Stewart, J. A. (2006). Genetic analysis of rubella viruses found in the United States between 1966 and 2004: evidence that indigenous rubella viruses have been eliminated. Clin Infect Dis 43 (Suppl. 3), S133S140.

Kakizawa, J., Nitta, Y., Yamashita, T., Ushijima, H. & Katow, S. (2001). Mutations of rubella virus vaccine TO-336 strain occurred in the attenuation process of wild progenitor virus. Vaccine 19, 27932802.[CrossRef][Medline]

Kang, S. Y., Yun, S. I., Park, H. S., Park, C. K., Choi, H. S. & Lee, Y. M. (2004). Molecular characterization of PL97-1, the first Korean isolate of the porcine reproductive and respiratory syndrome virus. Virus Res 104, 165179.[CrossRef][Medline]

Katow, S. (2004). Molecular epidemiology of rubella virus in Asia: utility for reduction in the burden of diseases due to congenital rubella syndrome. Pediatr Int 46, 207213.[CrossRef][Medline]

Katow, S., Minahara, H., Fukushima, M. & Yamaguchi, Y. (1997a). Molecular epidemiology of rubella by nucleotide sequences of the rubella virus E1 gene in three East Asian countries. J Infect Dis 176, 602616.[Medline]

Katow, S., Minahara, H., Ota, T. & Fukushima, M. (1997b). Identification of strain-specific nucleotide sequences in E1 and NS4 genes of rubella virus vaccine strains in Japan. Vaccine 15, 15791585.[CrossRef][Medline]

Kinney, R. M., Pfeffer, M., Tsuchiya, K. R., Chang, G. J. & Roehrig, J. T. (1998). Nucleotide sequences of the 26S mRNAs of the viruses defining the Venezuelan equine encephalitis antigenic complex. Am J Trop Med Hyg 59, 952964.[Abstract]

Koonin, E. V., Gorbalenya, A. E., Purdy, M. A., Rozanov, M. N., Reyes, G. R. & Bradley, D. W. (1992). Computer-assisted assignment of functional domains in the nonstructural polyprotein of hepatitis E virus: delineation of an additional group of positive-strand RNA plant and animal viruses. Proc Natl Acad Sci U S A 89, 82598263.[Abstract/Free Full Text]

Lund, K. D. & Chantler, J. K. (2000). Mapping of genetic determinants of rubella virus associated with growth in joint tissue. J Virol 74, 796804.[Abstract/Free Full Text]

Meyer, S., Weiss, G. & von Haeseler, A. (1999). Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 152, 11031110.[Abstract/Free Full Text]

Milne, I., Wright, F., Rowe, G., Marshall, D. F., Husmeier, D. & McGuire, G. (2004). TOPALi: software for automatic identification of recombinant sequences within DNA multiple alignments. Bioinformatics 20, 18061807.[Abstract/Free Full Text]

Nishizawa, T., Takahashi, M., Mizuo, H., Miyajima, H., Gotanda, Y. & Okamoto, H. (2003). Characterization of Japanese swine and human hepatitis E virus isolates of genotype IV with 99 % identity over the entire genome. J Gen Virol 84, 12451251.[Abstract/Free Full Text]

Pugachev, K. V., Abernathy, E. S. & Frey, T. K. (1997). Genomic sequence of the RA27/3 vaccine strain of rubella virus. Arch Virol 142, 11651180.[CrossRef][Medline]

Pugachev, K. V., Galinski, M. S. & Frey, T. K. (2000). Infectious cDNA clone of the RA27/3 vaccine strain of Rubella virus. Virology 273, 189197.[CrossRef][Medline]

Reef, S. E., Frey, T. K., Theall, K., Abernathy, E., Burnett, C. L., Icenogle, J., McCauley, M. M. & Wharton, M. (2002). The changing epidemiology of rubella in the 1990s: on the verge of elimination and new challenges for control and prevention. JAMA 287, 464472.[Abstract/Free Full Text]

Robertson, S. E., Featherstone, D. A., Gacic-Dobo, M. & Hersh, B. S. (2003). Rubella and congenital rubella syndrome: global update. Rev Panam Salud Publica 14, 306315.[Medline]

Rozanov, M. N., Koonin, E. V. & Gorbalenya, A. E. (1992). Conservation of the putative methyltransferase domain: a hallmark of the Sindbis-like supergroup of positive-strand RNA viruses. J Gen Virol 73, 21292134.[Abstract/Free Full Text]

Saitoh, M., Shinkawa, N., Shimada, S., Segawa, Y., Sadamasu, K., Hasegawa, M., Kato, M., Kozawa, K., Kuramoto, T. & other authors (2006). Phylogenetic analysis of envelope glycoprotein (E1) gene of rubella viruses prevalent in Japan in 2004. Microbiol Immunol 50, 179185.[Medline]

Saleh, S. M., Poidinger, M., Mackenzie, J. S., Broom, A. K., Lindsay, M. D. & Hall, R. A. (2003). Complete genomic sequence of the Australian south-west genotype of Sindbis virus: comparisons with other Sindbis strains and identification of a unique deletion in the 3'-untranslated region. Virus Genes 26, 317327.[CrossRef][Medline]

Salemi, M. & Vandamme, A.-M. (2003). The Phylogenetic Handbook: a Practical Approach to DNA and Protein Phylogeny. Cambridge: Cambridge University Press.

Strimmer, K. & von Haeseler, A. (1996). Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964969.

Takahashi, K., Kang, J. H., Ohnishi, S., Hino, K., Miyakawa, H., Miyakawa, Y., Maekubo, H. & Mishiro, S. (2003). Full-length sequences of six hepatitis E virus isolates of genotypes III and IV from patients with sporadic acute or fulminant hepatitis in Japan. Intervirology 46, 308318.[CrossRef][Medline]

Tamura, K. & Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol 10, 512526.[Abstract]

Tarbatt, C. J., Glasgow, G. M., Mooney, D. A., Sheahan, B. J. & Atkins, G. J. (1997). Sequence analysis of the avirulent, demyelinating A7 strain of Semliki Forest virus. J Gen Virol 78, 15511557.[Abstract]

van Cuyck, H., Juge, F. & Roques, P. (2003). Phylogenetic analysis of the first complete hepatitis E virus (HEV) genome from Africa. FEMS Immunol Med Microbiol 39, 133139.[CrossRef][Medline]

WHO (2005). Standardization of the nomenclature for genetic characteristics of wild-type rubella viruses. Wkly Epidemiol Rec 80, 126132.[Medline]

Xia, X. (2000). Data Analysis in Molecular Biology and Evolution. Boston: Kluwer Academic Publishers.

Xia, X. & Xie, Z. (2001). DAMBE: software package for data analysis in molecular biology and evolution. J Hered 92, 371373.[Abstract/Free Full Text]

Yang, Z. (1994). Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol 39, 306314.[CrossRef][Medline]

Yang, D. K., Kim, B. H., Kweon, C. H., Kwon, J. H., Lim, S. I. & Han, H. R. (2004). Molecular characterization of full-length genome of Japanese encephalitis virus (KV1899) isolated from pigs in Korea. J Vet Sci 5, 197205.[Medline]

Zheng, D. P., Frey, T. K., Icenogle, J., Katow, S., Abernathy, E. S., Song, K. J., Xu, W. B., Yarulin, V., Desjatskova, R. G. & other authors (2003a). Global distribution of rubella virus genotypes. Emerg Infect Dis 9, 15231530.[Medline]

Zheng, D. P., Zhou, Y. M., Zhao, K., Han, Y. R. & Frey, T. K. (2003b). Characterization of genotype II rubella virus strains. Arch Virol 148, 18351850.[CrossRef][Medline]

Zheng, D. P., Zhu, H., Revello, M. G., Gerna, G. & Frey, T. K. (2003c). Phylogenetic analysis of rubella virus isolated during a period of epidemic transmission in Italy, 1991-1997. J Infect Dis 187, 15871597.[CrossRef][Medline]

Received 24 August 2006; accepted 20 November 2006.