Introduction

The sequence of the Human Genome is essentially complete. Sequence of other vertebrate genomes is available in various stages of completion, with the most advanced being that of the mouse. The sequences of a variety of non-vertebrate genomes from fly to worm to bacteria to virus are being produced almost daily! 

The challenge the genome community now faces is using this vast amount of sequence to identify all the functional components encoded within, and understand their role. Genes are a major subset of these functional units and therefore the identification of all the genes is of critical importance if the genome sequences are to be fully utilized. 

Genes are currently identified within genome sequences using different methods that are listed below in decreasing order of confidence. The actual terms for each category are not consistent throughout the annotation community but the general principles still apply. The key is the supporting evidence. In general, the most confidently annotated genes have the best supporting evidence.

  1. Known genes � as catalogued in LocusLink/RefSeq
  2. Novel genes based on similarity to known genes � as annotated e.g by Ensembl
  3. Novel genes based on the presence of EST match � including EST genes in Ensembl, and the DOTs genes
  4. Putative or predicted genes, as identified by gene prediction programs such as Genscan and TWINSCAN
  5. Pseudogenes � a minefield!

Aims

  • To understand the accuracy of the genes being presented at the various genome browsers, including Ensembl and UCSC.
  • Using Ensembl or UCSC, to analyse a �region of interest� for all the genes being annotated. At the end of this module, participants should be able to categorise genes based on the supporting evidence.

Exercise � Analyze genes annotated within a given region of mouse genome

  1. Go to mouse Ensembl from the Ensembl home page (http://www.ensembl.org)
  2. Click on the chromosome 5 ideogram
  3. Type region between D5Mit128 and D5Mit107 (this is your �region of interest�)
  4. At this stage the Ensembl contigview page is shown but the region you are viewing is too large to view genes in detail, only the overview is shown. Therefore, you will need to click on part of your region to bring up the �detailed view�
  5. Now working from one end of the region to other identify the different types of genes within the region
    1. What is the total number of known and novel genes?
    2. Are any of the genes catalogued in Refseq? If so, what is their status?
    3. For the known genes, find the �Refseq� status for them (e.g. provisional)
    4. Can you see EST genes? What data do they add to the annotation?
    5. Take a look at the ab initio gene prediction programs such as Genscan and TWINSCAN � do they add any more information?
  6. Now go to the same region in the UCSC (http://genome.ucsc.edu/cgi-bin/hgGateway). (At time of print, you were not able view a region between two Mit markers in UCSC)
    1. Do a similar count as you did in Ensembl. What do you notice about the differences between the Ensembl annotation and the UCSC annotation?
    2. Which do you prefer?
  7. Go back to Ensembl and see if you can export all the known and novel genes in the region using EnsMart. Export the file in Excel format to view. See what other features might be useful to include.

Transcription Factor Binding Sites

MatInspector | TRANSFAC | Mirror

TRANSFAC The Transcription Factor Database

Vertebrate homeotic Hox proteins nomenclature and index (ExPASy)

DPInteract A database on DNA-protein interactions

TFSEARCH Searching Transcription Factor Binding Sites (Yutaka Akiyama)

Transcription factor binding site clusters in the human genome David J. States

Transcription Factor Database a database of the DNA recognition sequences for eukaryotic and prokaryotic sequence-specific transcription factors

Identifying Transcription Factor Binding Sites Matinspector, ModelInspector & Signalscan

Transcription Factor Imaging with the Atomic Force Microscope AFM imaging Functional Genomics Initiative at ORNL

Evaluation of computer tools for the prediction of transcription Emmanuelle Roulet et al.

Heat Shock Transcription Factor Cho, H.S et al.

References

Adams, M. D., S. E. Celniker, et al. (2000). �The genome sequence of Drosophila melanogaster.� Science 287(5461): 2185-95.

AGI, A. G. I. (2000). �Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.� Nature 408(6814): 796-815.

Bailey, L. C., Jr., D. B. Searls, et al. (1998). �Analysis of EST-driven gene annotation in human genomic sequence.� Genome Research 8(4): 362-76.

Burge, C. and S. Karlin (1997). �Prediction of complete gene structures in human genomic DNA.� J Mol Biol 268(1): 78-94

Burge, C. B. and S. Karlin (1998). �Finding the genes in genomic DNA.� Curr Opin Struct Biol 8(3): 346-54.

Consortium, C. e. S. (1998). �Genome sequence of the nematode C. elegans: a platform for investigating biology. The C. elegans Sequencing Consortium.� Science 282(5396): 2012-8.

Eddy, S. R. (1996). �Hidden Markov models.� Curr Opin Struct Biol 6(3): 361-5.

Eddy, S. R. (2001). �Non-coding RNA genes and the modern RNA world.� Nat Rev Genet 2(12): 919-29.

Fickett, J. W. (1996). �Finding genes by computer: the state of the art.� Trends Genet 12(8): 316-20

Florea, L., G. Hartzell, et al. (1998). �A computer program for aligning a cDNA sequence with a genomic DNA sequence.� Genome Research 8(9): 967-74.

Fraser, C. M., J. A. Eisen, et al. (2000). �Microbial genome sequencing.� Nature 406(6797): 799-803.

Goff, S. A., D. Ricke, et al. (2002). �A draft sequence of the rice genome (Oryza sativa L. ssp. japonica).� Science 296(5565): 92-100.

Guigo, R., P. Agarwal, et al. (2000). �An assessment of gene prediction accuracy in large DNA sequences.� Genome Research 10(10): 1631-42

Lander, E. S., L. M. Linton, et al. (2001). �Initial sequencing and analysis of the human genome.� Nature 409(6822): 860-921.

Lowe, T. M. and S. R. Eddy (1997). �tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.� Nucleic Acids Res 25(5): 955-64.

Marshall, E. (2002). �Genome sequencing. Public group completes draft of the mouse.� Science 296(5570): 1005.

Parra, G., et al. (2003). "Comparative Gene Prediction in Human and Mouse." Genome Research 13:108-117.

Reese, M. G., et al. (2000). �Genome annotation assessment in Drosophila melanogaster.� Genome Research 10(4): 483-501

Rogic, S., et al. (2001). Evaluation of Gene-Finding Programs in Mammalian Sequences. Genome Research 11:817-832.

Stein, L. (2001). �Genome annotation: from sequence to biology.� Nat Rev Genet 2(7): 493-503

Uberbacher, E. C. and R. J. Mural (1991). �Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach.� Proc Natl Acad Sci U S A 88(24): 11261-5

Uberbacher, E. C., Y. Xu, et al. (1996). �Discovering and understanding genes in human DNA sequence using GRAIL.� Methods Enzymol 266: 259-81.

Yu, J., S. Hu, et al. (2002). �A draft sequence of the rice genome (Oryza sativa L. ssp. indica).� Science 296(5565): 79-92.

Zhang, M. Q. (2002). �Computational Prediction of Eukaryotic Protein-Coding Genes.� Nat Rev Genet 3(9): 698-709.