The long-term objective is to develop computer technology needed to accomplish the objectives of the Human Genome Project and to apply the technology to the analysis and management of sequencing data. Currently, a database search for sequence similarities represents the most direct computational approach to the analysis of genomic information. However, the search is becoming ever more forbidding due to the accelerating growth of sequencing data. The goal of the proposed research is to further develop and enhance a software tool for speedy classification of unknown sequences, and make it available to the genome community. The research will build upon a pilot system designed and developed by the principal investigator that has shown great promise.
The specific aims are (1) to enhance the tool for speedy identification of PIR superfamilies and ProSite patterns, (2) to develop a pilot DNA/RNA classification system, (3) to distribute the tool, and (4) to aid PIR protein database and RDP ribosomal RNA database organization. In contrast to other search methods whose search time grows linearly with the number of entries in the database, the time of the proposed tool grows with the number of families, which is likely to remain low. The tool would automate family assignment which is especially important for managing the influx of new data in a timely manner. The proposed research applies neural network technology to solving the database search/organization problem. The major design principles involve an encoding schema to extract sequence information and a modular architecture to scale up backpropagation networks. The encoding algorithm is a hashing function similar to the k-tuple method. A pilot system has been implemented on a Cray supercomputer to classify electron transfer proteins and enzymes. The system achieves about 90% accuracy and 50 times speed of other search methods. The speed may be 1000 times faster than others in a decade if the database continues to grow at the current rate. In the proposed research, the sensitivity of the tool would be improved and a full-scale system would be developed. The automated software tool would be portable at the source code, user interface, and hardware levels. The system would be updated in accordance with database releases, and distributed to the research community via anonymous ftp. The tool would be used to classify PIR sequences according to superfamilies and to classify ribosomal RNA sequences according to phylogenetic relations.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
First Independent Research Support & Transition (FIRST) Awards (R29)
Project #
5R29LM005524-05
Application #
2445394
Study Section
Genome Study Section (GNM)
Program Officer
Bean, Carol A
Project Start
1993-07-01
Project End
1999-09-30
Budget Start
1997-07-01
Budget End
1999-09-30
Support Year
5
Fiscal Year
1997
Total Cost
Indirect Cost
Name
University of Texas Health Center at Tyler
Department
Public Health & Prev Medicine
Type
Other Domestic Higher Education
DUNS #
City
Tyler
State
TX
Country
United States
Zip Code
75708
Wu, C H; Shivakumar, S; Huang, H (1999) ProClass Protein Family Database. Nucleic Acids Res 27:272-4
Wu, C H; Shivakumar, S (1998) Proclass protein family database: new version with motif alignments. Pac Symp Biocomput :719-30
Wu, C H (1997) Artificial neural networks for molecular sequence analysis. Comput Chem 21:237-56
Wu, C H; Chen, H L; Lo, C J et al. (1996) Motif identification neural design for rapid and sensitive protein family search. Pac Symp Biocomput :674-85
Wu, C H; Zhao, S; Chen, H L et al. (1996) Motif identification neural design for rapid and sensitive protein family search. Comput Appl Biosci 12:109-18
Wu, C H; Zhao, S; Chen, H L (1996) A protein class database organized with ProSite protein groups and PIR superfamilies. J Comput Biol 3:547-61
Wu, C; Shivakumar, S (1994) Back-propagation and counter-propagation neural networks for phylogenetic classification of ribosomal RNA sequences. Nucleic Acids Res 22:4291-9