Introduction

Comparative genome analysis is a powerful method for aiding gene identification, inferring function of a gene�s product, and identifying novel functional elements such as those involved in transcriptional control. In order to identify the functionally important units it may be necessary to compare genome sequences from a variety of organisms, although any organism-specific features will not be detected by this strategy. The more distantly related organisms are likely to show sequence conservation in coding regions alone. This may also be the case for distantly related vertebrates such as fish and human. The more closely related organisms, such as two mammals, or two species of worm, are likely to be conserved in coding regions, but also in other functional elements such as regulatory sequences. However, the closer the evolutionary relationship between the two organisms being considered, the more �sequence noise� is likely to arise where non-functional sequence appears similar because insufficient time has elapsed for the two sequences to diverge. The closer the organism, the more the differences become important, e.g. between human and chimp. The most extreme example of this is seen in human sequence variation.

Comparative Analysis of Genes

Orthologous genes are defined as being homologous genes in different organisms derived from the same gene during speciation. When inferring the function of one gene based on the function of a predicted orthologue, it is important to be able to distinguish, where possible, between:

  1. Genes that are true functional orthologues, i.e genes that have the same function in two or more organisms.
  2. Genes that may originally have been derived from the same gene in a common ancestor, but during subsequent independent evolution have evolved different function.
  3. Genes that have duplicated in a single species (paralogues) and at least one copy may have evolved a novel function.

When analyzing potential orthologous sequences, it is important to be confident that you are dealing with true orthologues. This can be done using the following information:

  1. Percentage identity at both the nucleotide and protein level
  2. Exon/Intron structure comparison
  3. Positional information (analysis of neighbouring genes)
  4. Evolutionary analysis of nucleotide/protein sequence (phylogeny)
  5. No other similar sequence in rest of genome

Comparative Genome Analysis

When two species diverge from a common ancestor those sequences that maintain their original function are likely to remain conserved in both species throughout their subsequent independent evolution. Therefore comparing sequences in different species is a powerful tool for increasing the confidence of a predicted functional unit, or identifying novel functional units. Click here for a table showing the major genomes sequenced or in progress, their web addresses, and useful sites for viewing the data. (Check also Ensembl, UCSC and NCBI first to see what data they are making available as these three sites make data available in a uniform and linked fashion). This table is not exhaustive. The WGS method is being used to generate large numbers of sequences for many genomes and these unassembled sequence traces are made available through two trace acrchives at The Ensembl trace archive (http://trace.ensembl.org) and the NCBI trace archive (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?)

Aims

  • Review the information available from different organisms
  • Identify potential orthologues
  • Compare genome sequence from different organisms and identify conserved sequences

Exercises - Comparing Genomic DNA

Exercise 1 � Comparative gene analysis: Using the mouse Hoxc9 gene in Ensembl (or your favourite gene)

  1. Identify all potential paralogues. Which chromosomes are they located on?
  2. Identify all potential orthologues. What is the total number of sequences in the �family�
  3. Export the sequences for further analysis (e.g phylogenetic analysis)
  4. View the multiple sequence alignment in Jalview

Exercise 2 � Comparative genome analysis: Generate genomic sequence alignments using PIP and VISTA to look for conserved features.

Using the mouse Cpeb2 gene:

  • View the comparative sequence analysis being shown the UCSC

Now try the process yourself.  Using the mouse Cpeb2 gene in Ensembl (save the following sequence files as text files, not word files):

  1. Export the cDNA sequence
  2. Export the genomic sequence including 5 kb of upstream and downstream sequence
  3. Export the equivalent sequence in human (include 5 kb upstream and downstream)
  4. Align the mouse genomic and cDNA sequence using SIM4 to generate an �annotation file� for VISTA and PIP
  5. For PIP you may want to repeatmask your mouse sequence. In this case, the mouse genomic sequence can be exported from the UCSC (if in reality your sequence is not in a genome browser, you can repeatmask sequences yourself at the RepeatMasker website
  6. Also for PIP if you want colour. a �sequence underlay� file is needed. Use the �annotation file� as the basis for this.
  7. You should now have all the information required to run PIP and VISTA (mVISTA) � have a go. Any errors you have made will be reported when you try to submit the job. If successful, the results are emailed to you, but results files have been generated in advance for you to view.
  8. For VISTA, now turn on the regulatory element section and run again.
  9. If all that is not enough, go back and export the equivalent genomic sequence for a third organism (e.g zebrafish, chimp, worm) and re-run PIP or VISTA














 

 

 

 

Exercise 2 Files