Screening genomes of Gram-positive bacteria for double-glycine-motif-containing peptides

Secreted peptides fulfil major functions in the physiology of eukaryotes as well as prokaryotes. Yet, in many genome-sequencing projects, small peptides either remain un-annotated or are classified as hypothetical open reading frames, without any function associated. Therefore, the identification of signal peptide sequences would help in finding novel peptide genes or in assigning functions to automatically annotated sequences.

In Gram-positive bacteria, the double-glycine (GG) motif plays a key role in many peptide secretion systems involved in quorum sensing and bacteriocin production. Competence-stimulating peptides and class II bacteriocins, produced by streptococci and lactic acid bacteria, respectively, are generally synthesized as inactive prepeptides containing a conserved GG-type leader sequence. This leader sequence, typically between 15 and 30 aa in length, is recognized and proteolytically removed during secretion by its cognate ABC-transporter, resulting in the release and activation of the peptide. Processed peptides vary in length from 17 to over 80 aa. The GG-type leader sequence is well conserved and possesses the following consensus: LSX₂ELX₂IXGG (Havarstein et al., 1994). Beside this conserved leader sequence, GG-motif-containing peptides lack common sequence similarities. Their cognate transporters contain a specific domain of about 150 aa which is responsible for the proteolytic removal of the GG-type leader peptide and, on the basis of its sequence, has been classified as the Peptidase C39 protein family domain. The Peptidase C39 domain contains two conserved motifs, called the cysteine and the histidine motifs (Havarstein et al., 1995).

Our aim was to detect new GG-motif-containing peptides in the fully sequenced genomes of Gram-positive bacteria. Since many peptides containing such a motif are small, it is likely that many of them may not have been annotated in genome-sequencing projects or have not been recognized as secreted peptides. Therefore, an in silico strategy was designed and applied at the nucleotide level. The 45 fully sequenced genomes of Gram-positive bacteria [situation on 15 September 2003; for a complete list, see Dirix et al. (2004)] were screened both for the presence of GG-motifs and for Peptidase C39 domains. For the latter screening, a motif was available (; accession number PF03412) (Bateman et al., 2002); for the GG-motif search, a new model was built based on already known GG-motif peptides (Dirix et al., 2004; Michiels et al., 2001). Based on our knowledge of characterized GG-motif-containing peptides, several restrictions on the GG-motif candidate genes were imposed. First, the GG-motif was forced to end with a Gly-Gly or a Gly-Ala pair. Secondly, only those peptides were selected from which the coding region was located less than 10 kb from the coding region of a Peptidase C39 domain-containing gene. Finally, the length of the leader sequence and the total peptide length were set to a maximum of 30 and 150 aa, respectively. As a result, by using these restrictions, we cannot exclude that some GG-motif peptides were not retrieved during the screening process.

A search for the Peptidase C39 domain in 45 fully sequenced Gram-positive genomes resulted in a total of 29 hits. These hits were found in the genera Bacillus, Clostridium, Enterococcus, Lactobacillus, Lactococcus, Mycoplasma, Streptococcus, Streptomyces and Ureaplasma, but not in the genus Bifidobacterium, Corynebacterium, Deinococcus, Listeria, Mycobacterium, Oceanobacillus, Staphylococcus or Tropheryma. Interestingly, all of the screened lactic acid bacteria, with the exception of Streptococcus agalactiae (strains 2603V/R and NEM316) and Bifidobacterium longum NCC2705, contain a Peptidase C39 domain. In several strains belonging to the genera Streptococcus and Enterococcus, more than one protein containing the C39 domain was found. Besides two protein hits that are truncated in their Peptidase C39 domain, all hits contain the conserved cysteine and histidine motifs involved in GG-motif recognition and peptidase activity (Havarstein et al., 1995), suggesting that those domains have peptidase activity.

The screening for peptides containing a GG-motif resulted in a total of 48 candidate peptides. Although out of the 45 screened bacterial genomes, only 12 genomes were from lactic acid bacteria, 92 % of all GG-motif-containing hits were found in lactic acid bacteria (of which 80 % belong to streptococcal strains). The size of the peptides ranges from 29 to 126 aa, or in the mature form (i.e. without the leader peptide) from 11 to 103 aa. A list of the possible GG-motif-containing peptides, their cognate transport protein, their length, amino acid context, theoretical pI and molecular mass is given in Table 1. If available, the gene name of the GG-peptide-encoding sequence was taken from the genome annotation data and included in Table 1. Sixty-seven per cent of the candidate peptides have a high glycine content (>10 % Gly), whereas in 63 % of the peptides more than half of the amino acids are hydrophobic. Also, half of the hits have two or more cysteine residues and, in 56 % of the peptides, the theoretical pI is higher than 8. These data are consistent with the properties of previously described GG-motif-containing peptides (Ennahar et al., 2000; Jack et al., 1995). Among the 48 hits, three were not annotated in the corresponding genome sequence project. Seventeen hits, annotated as hypothetical proteins, did not display similarity to any known protein or peptide. The remaining hits are bacteriocins (n=15) or bacteriocin homologues (n=10), a conserved domain protein (n=1), a plantaricin biosynthesis protein (n=1) and a phage-related protein (n=1).

Abstract