It is well known both that the speed of sequencing far outstrips the speed at which sequence can be experimentally analyzed to identify new genes and classify them according to function, and that combining computational and experimental evidence significantly speeds this analysis. More accurate algorithms will be economically as well as scientifically advantageous, since the decision to perform an experiment is often based in part on computational results. This proposal is to significantly improve the state of the art of computational methods for gene identification and classification, in each of three areas: 1. Bench marking. Current algorithms often fail to use the best methods, primarily because most methods are of unknown accuracy. Recently, Fickett & Tung made the first comprehensive assessment of coding region detection measures. This study showed that while many packages still base coding region detection on codon counts, in-phase hexamer counts can give better accuracy. It was also shown that merely combining the six best coding measures with a linear discriminant gives improvement over the already impressive Coding Recognition Module of GRAIL. Further assessment will be done for decision methods; for transcription, splicing, and translation signal detection; and for characterization of overall gene syntax. The best methods will be refined. 2. Biology. Current algorithms incorporate a number of elegant computational and statistical techniques, but none incorporates a model of transcription, splicing, and translation that is current with biological understanding. The Kozak rules for location of the translation initiation codon provide one clear example. Another is that, while it is not yet possible to describe eukaryotic promoters in detail, the current norm of always requiring a simple consensus CAAT and TATA box can be improved upon. Also, ft can be shown that taking the domain structure of genomes into account reduces prediction errors by 20%. 3. Integration. Most investigators currently gather information independently (to a first approximation) from experiment, from database searches, and from gene identification algorithms, and afterwards mentally integrate it to arrive at tentative locations and possible functions of genes in a sequence. However, data from each of these sources can influence not only the interpretation of data from the others, but even which data are brought to one's attention. Under this proposal algorithms will be developed that can take voluminous low-level data from the three sources and give an overall summary consistent (insofar as possible) with all the information.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG000981-01A1
Application #
2209225
Study Section
Genome Study Section (GNM)
Project Start
1994-08-01
Project End
1997-05-31
Budget Start
1994-08-01
Budget End
1995-05-31
Support Year
1
Fiscal Year
1994
Total Cost
Indirect Cost
Name
Los Alamos National Lab
Department
Type
Organized Research Units
DUNS #
City
Los Alamos
State
NM
Country
United States
Zip Code
87545
Wasserman, W W; Palumbo, M; Thompson, W et al. (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26:225-8
Wasserman, W W; Fickett, J W (1998) Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 278:167-81
Guigo, R (1998) Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol 5:681-702
Fickett, J W (1998) Predictive methods using nucleotide sequences. Methods Biochem Anal 39:231-45
Fickett, J W; Hatzigeorgiou, A G (1997) Eukaryotic promoter recognition. Genome Res 7:861-78
Guigo, R (1997) Computational gene identification: an open problem. Comput Chem 21:215-22
Fickett, J W (1996) Coordinate positioning of MEF2 and myogenin binding sites. Gene 172:GC19-32
Fickett, J W (1996) Finding genes by computer: the state of the art. Trends Genet 12:316-20
Fickett, J W (1996) Quantitative discrimination of MEF2 sites. Mol Cell Biol 16:437-41
Burset, M; Guigo, R (1996) Evaluation of gene structure prediction programs. Genomics 34:353-67