|
|
|
|
Introduction
The sequence of the Human Genome is essentially complete. Sequence of other vertebrate genomes is available in various stages of completion, with the most advanced being that of the mouse. The sequences of a variety of non-vertebrate genomes from fly to worm to bacteria to virus are being produced almost daily!
The challenge the genome community now faces is using this vast amount of sequence to
identify all the functional components encoded within, and understand their role. Genes are a major subset of these functional units and therefore the identification of all the genes is of critical importance if the genome sequences are to be fully
utilized.
Genes are currently identified within genome sequences using different methods
that are listed below in decreasing order of confidence. The actual terms for each category are not consistent throughout the annotation community but the general principles still apply. The key is the
supporting evidence. In general, the most confidently annotated genes have the best supporting evidence.
- Known genes � as catalogued in LocusLink/RefSeq
- Novel genes based on similarity to known genes � as annotated e.g by Ensembl
- Novel genes based on the presence of EST match � including EST genes in Ensembl, and the DOTs genes
- Putative or predicted genes, as identified by gene prediction programs such as
Genscan and TWINSCAN
- Pseudogenes � a minefield!
Aims
- To understand the accuracy of the genes being presented at the various genome browsers, including Ensembl and
UCSC.
- Using Ensembl or UCSC, to analyse a �region of interest� for all the genes being annotated. At the end of this module, participants should be able to categorise genes based on the supporting evidence.
Exercise � Analyze genes annotated within a given region of mouse genome
- Go to mouse Ensembl from the Ensembl home page (http://www.ensembl.org)
- Click on the chromosome 5 ideogram
- Type region between D5Mit128 and D5Mit107 (this is your �region of interest�)
- At this stage the Ensembl contigview page is shown but the region you are viewing is too large to view genes in detail, only the overview is shown. Therefore, you will need to click on part of your region to bring up the �detailed view�
- Now working from one end of the region to other identify the different types of genes within the region
- What is the total number of known and novel genes?
- Are any of the genes catalogued in Refseq? If so, what is
their status?
- For the known genes, find the �Refseq� status for them (e.g. provisional)
- Can you see EST genes? What data do they add to the annotation?
- Take a look at the ab initio gene prediction programs such as
Genscan and TWINSCAN
� do they add any more information?
- Now go to the same region in the UCSC (http://genome.ucsc.edu/cgi-bin/hgGateway). (At time of print, you were not able view a region between two Mit markers in UCSC)
- Do a similar count as you did in Ensembl. What do you notice about the differences between the Ensembl annotation and the UCSC annotation?
- Which do you prefer?
- Go back to Ensembl and see if you can export all the known and novel genes in the region using EnsMart. Export the file in Excel format to view. See what other features might be useful to include.
Transcription Factor Binding Sites
MatInspector
| TRANSFAC | Mirror
TRANSFAC
The Transcription Factor Database
Vertebrate
homeotic Hox proteins nomenclature and index (ExPASy)
DPInteract
A database on DNA-protein interactions
TFSEARCH
Searching Transcription Factor Binding Sites (Yutaka Akiyama)
Transcription
factor binding site clusters in the human genome David
J. States
Transcription
Factor Database a database of the DNA recognition
sequences for eukaryotic and prokaryotic sequence-specific
transcription factors
Identifying
Transcription Factor Binding Sites Matinspector,
ModelInspector & Signalscan
Transcription
Factor Imaging with the Atomic Force Microscope AFM
imaging Functional
Genomics Initiative at ORNL
Evaluation
of computer tools for the prediction of transcription Emmanuelle
Roulet et al.
Heat
Shock Transcription Factor Cho, H.S et al.
References
Adams, M. D., S. E. Celniker, et al. (2000). �The genome sequence
of Drosophila melanogaster.� Science 287(5461): 2185-95.
AGI, A. G. I. (2000). �Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana.� Nature 408(6814): 796-815.
Bailey, L. C., Jr., D. B. Searls, et al. (1998). �Analysis of
EST-driven gene annotation in human genomic sequence.� Genome
Research 8(4): 362-76.
Burge,
C. and S. Karlin (1997). �Prediction of complete gene structures in
human genomic DNA.� J Mol Biol 268(1): 78-94
Burge, C. B. and S. Karlin (1998). �Finding the genes in genomic
DNA.� Curr Opin Struct Biol 8(3): 346-54.
Consortium, C. e. S. (1998). �Genome sequence of the nematode C.
elegans: a platform for investigating biology. The C. elegans
Sequencing Consortium.� Science 282(5396): 2012-8.
Eddy, S. R. (1996). �Hidden Markov models.� Curr Opin Struct Biol
6(3): 361-5.
Eddy, S. R. (2001). �Non-coding RNA genes and the modern RNA
world.� Nat Rev Genet 2(12): 919-29.
Fickett,
J. W. (1996). �Finding genes by computer: the state of the art.�
Trends Genet 12(8): 316-20
Florea, L., G. Hartzell, et al. (1998). �A computer program for
aligning a cDNA sequence with a genomic DNA sequence.� Genome
Research 8(9): 967-74.
Fraser, C. M., J. A. Eisen, et al. (2000). �Microbial genome
sequencing.� Nature 406(6797): 799-803.
Goff, S. A., D. Ricke, et al. (2002). �A draft sequence of the rice
genome (Oryza sativa L. ssp. japonica).� Science 296(5565): 92-100.
Guigo,
R., P. Agarwal, et al. (2000). �An assessment of gene prediction
accuracy in large DNA sequences.� Genome Research 10(10): 1631-42
Lander, E. S., L. M. Linton, et al. (2001). �Initial sequencing and
analysis of the human genome.� Nature 409(6822): 860-921.
Lowe, T. M. and S. R. Eddy (1997). �tRNAscan-SE: a program for
improved detection of transfer RNA genes in genomic sequence.�
Nucleic Acids Res 25(5): 955-64.
Marshall, E. (2002). �Genome sequencing. Public group completes
draft of the mouse.� Science 296(5570): 1005.
Parra, G., et al. (2003). "Comparative Gene Prediction in Human
and Mouse." Genome Research 13:108-117.
Reese, M. G., et al. (2000). �Genome annotation assessment in
Drosophila melanogaster.� Genome Research 10(4): 483-501
Rogic, S., et al. (2001). Evaluation of Gene-Finding Programs in
Mammalian Sequences. Genome Research 11:817-832.
Stein,
L. (2001). �Genome annotation: from sequence to biology.� Nat Rev
Genet 2(7): 493-503
Uberbacher,
E. C. and R. J. Mural (1991). �Locating protein-coding regions in
human DNA sequences by a multiple sensor-neural network approach.�
Proc Natl Acad Sci U S A 88(24): 11261-5
Uberbacher, E. C., Y. Xu, et al. (1996). �Discovering and
understanding genes in human DNA sequence using GRAIL.� Methods
Enzymol 266: 259-81.
Yu, J., S. Hu, et al. (2002). �A draft sequence of the rice genome (Oryza
sativa L. ssp. indica).� Science 296(5565): 79-92.
Zhang, M. Q. (2002). �Computational Prediction of Eukaryotic
Protein-Coding Genes.� Nat Rev Genet 3(9): 698-709.
|
|
|
|

|