Jong's Ph.D. thesis Summary
Complete genomes can provide bioinformaticians with unique opportunities to analyse a large number of sequences in a genomic context.
We investigated the evolutionary relationships of protein sequences by examining the level of gene duplication with sequence search algorithms. T
he degree of duplication has been explored for each single and all six genomes together. An all-against-all comparison of the 12,013 putative protein sequences from six genomes of which two were from relatives of Gram-positive, three from Gram-negative and one from an archaebacterium, showed that the proportion of sequences produced by duplication is 22% for the smallest and 44% for the largest genome.
Two thirds of the sequences, including half of those in the archaebacterium, have one or more inter-genome matches. Pairwise sequence comparison underestimate the true extent of evolutionary relationships. Thus the result show that bacterial genomes are made, in large part, of sequences drawn from a relatively small common repertoire of gene families.
For the study, we used well established sequence search methods like SSEARCH and FASTA. However, the sensitivity of those methods was less than 16% in a test done with structurally known protein sequences. To raise the sensitivity, a new method called intermediate sequence search procedure was devised and we have shown that the sensitivity with 1% error rate, was 25% (70% percent increase from the single pairwise search by FASTA). The idea of using intermediate sequence is not new.
Profile, template and hidden Markov sequence models also utilise additional information derived from additional sequences as intermediate sequence search does. However, all the profile related approaches rely on multiple sequence alignment programs while there is no accurate algorithm for very distant sequences. The erroneous alignments and consequent profiles and models will result in a decrease insensitivity.
One of the reasons why it is a difficult problem to align distant sequences is that when the compositions of distant sequences are similar, the discrimination ability of the programs drops. By utilising this, we have developed a program (SC_rate.pl) which tries to predict the reliability of sequence alignments by comparing the sequence and composition identities. It is used to provide us with more systematic and objective judgement when dealing with twilight zone sequences.
After initial sequence comparison all the genome sequences were clustered. However, as a significant portion of protein sequences are multidomain, simple linkage generated wrong clusters. To tackle this it was necessary to break down the simple linkage clusters according to their domains. An algorithm was developed for this purpose. It tracks the linkage of the original simple clusters and breaks down when the linkage is broken producing smaller clusters.
By this subclustering, it was possible to calculate more accurate duplication levels in genomes.