It is well known both that the speed of sequencing far outstrips the speed at which sequence can be experimentally analyzed to identify new genes and classify them according to function, and that combining computational and experimental evidence significantly speeds this analysis. More accurate algorithms will be economically as well as scientifically advantageous, since the decision to perform an experiment is often based in part on computational results. This proposal is to significantly improve the state of the art of computational methods for gene identification and classification, in each of three areas: 1. Bench marking. Current algorithms often fail to use the best methods, primarily because most methods are of unknown accuracy. Recently, Fickett & Tung made the first comprehensive assessment of coding region detection measures. This study showed that while many packages still base coding region detection on codon counts, in-phase hexamer counts can give better accuracy. It was also shown that merely combining the six best coding measures with a linear discriminant gives improvement over the already impressive Coding Recognition Module of GRAIL. Further assessment will be done for decision methods; for transcription, splicing, and translation signal detection; and for characterization of overall gene syntax. The best methods will be refined. 2. Biology. Current algorithms incorporate a number of elegant computational and statistical techniques, but none incorporates a model of transcription, splicing, and translation that is current with biological understanding. The Kozak rules for location of the translation initiation codon provide one clear example. Another is that, while it is not yet possible to describe eukaryotic promoters in detail, the current norm of always requiring a simple consensus CAAT and TATA box can be improved upon. Also, ft can be shown that taking the domain structure of genomes into account reduces prediction errors by 20%. 3. Integration. Most investigators currently gather information independently (to a first approximation) from experiment, from database searches, and from gene identification algorithms, and afterwards mentally integrate it to arrive at tentative locations and possible functions of genes in a sequence. However, data from each of these sources can influence not only the interpretation of data from the others, but even which data are brought to one's attention. Under this proposal algorithms will be developed that can take voluminous low-level data from the three sources and give an overall summary consistent (insofar as possible) with all the information.