Systematic comparisons of genes, proteins and their interactions between different organisms are beginning to yield insights into the generation of diversity. At the sequence level, global studies are beginning to reveal a wide spectrum of sequence homology. At one extreme, sequences may provide such fundamental biological functionality (e.g. histones) that they are highly conserved across a majority of organisms. At the other end of the spectrum, sequences may be specific to a single species, imparting unique properties to that organism. The identification of these relationships will improve our understanding of how organisms have adapted and evolved to fill particular biological niches. Our lab is particularly interested in applying such findings to the study of infectious diseases - how do new pathogens arise ? what adaptations mediate the survival of parasites within their human hosts ? can we exploit this knowledge for the design of novel therapeutics ?
To aid in this research, we have developed a number of published tools and databases which are now being implemented as standards by other international genomic projects, (e.g. the Natural Environment Research Council, UK environmental genomics initiative). We have continued to develop these tools, the culmination of which has been the creation of a centralized database resource PartiGeneDB (http://www.partigenedb.org). With representation from over 450 species, it provides the most comprehensive source of genes associated with eukaryotes. In collaboration with groups from Edinburgh and the Genome Sequencing Center in St. Louis, we have exploited these data to complete a global survey of the genes identified by the parasitic nematode EST project, the largest comparative investigation of a single phylum. The study revealed a spectrum of sequence diversity from species unique to pan-nematode and beyond. 4,228 nematode-specific protein families were identified which likely underpins species- and higher-level taxonomic disparity and provide a valuable source of novel drug and vaccine targets. Building on these analyses we have performed a more comprehensive studies of genetic diversity and metabolic networks across the three domains of life (see figure 1).

Figure 1. Taxonomic distribution of sequences from eukaryotic partial genomes. On the basis of its phylogenetic profile, each sequence from each partial genome is assigned to a single evolutionary group. A schematic detailing the phylogenetic relationships of the defined eukaryotic groups is provided in the lower left of the figure. For each taxonomic group the numbers represent: number of genomes analysed (white text on black); percentage of sequences which are species specific (black text on white); percentage of sequences which are taxon specific – i.e. share sequence similarity only with a sequence(s) from a species from the same taxon (light gray background); and the total number of sequences (blue or orange text). Numbers in dark gray boxes indicate the percentage of sequences with similarity to sequence(s) from the neighboring taxon, but not to any other taxon, and may thus represent lineage specific sequences. The numbers in the triangle represents the percentage of sequences from each of three major taxonomic groups (protists, plants and fungi/metazoa) with sequence similarity to each of the other groups. The numbers in the middle of the triangle indicate the percentage of genes from each group (protists, fungi/metazoa, plants top to bottom) which have sequence similarity to both of the other two groups.
Comparisons with prokaryotic datasets reveal that the rate of new sequence discovery in the eukaryotic datasets is much greater, suggesting a higher level of genetic diversity. Mapping this diversity within a phylogenetic perspective (Figure 1) reveals the majority (40-60%) of eukaryotic sequences to be specific to individual or closely related species. On the other hand ~20% of eukaryotic sequences are highly conserved and were associated with basic housekeeping functions. Between these two extremes, several evolutionary ‘hotspots’ consisting of large numbers of sequences conserved within specific taxonomic groups were identified. For example 8% of sequences derived from metazoan species are specific and conserved within the metazoan lineage. These sequences likely underlie metazoan-specific processes such as cell-cell communication and cell differentiation. This is the first study which capitalizes on the use of partial genome datasets to perform a detailed exploration of sequence diversity across a broad sample of taxonomic groups within Eukarya.
Future studies are aimed at further exploiting these datasets, together with sequence data from organisms with fully sequenced genomes, in the following contexts:
- Deriving robust phylogenies and molecular clocks (in collaboration with Dr Davide Pisani, National University of Ireland)
- Exploring the diversity and specificity of protein domains
- Identifying novel parasitic adaptations (in collaboration with Dr Mike Grigg, NIH)
- Mapping the evolution of protein-interaction networks (in collaboration with other researchers within Toronto)



