The determination of motifs that classify entries in a database is important especially when one has minimal information concerning the motifs that determine a specific family. This project, a form of datamining, revolves around the concept of computationally finding the largest approximately identical substructures in a set of data objects. These discovered substructures or motifs are then tested against the data to see how well they characterize the data in the sense of being good classifiers. The database of objects can consist of entities such as sequences, trees, graphs or records. We have applied these classification techniques to biological databases in the following three areas: 1) 3-D graphs representing bio-molecules. We were able to show a 91% precision rate in determing motifs to classify three different families of molecules. Z01 BC 10045-02 LMMB to LECB 2) tree structures representing RNA secondary structure. We were able to discover tree motifs that classified three families of RNA structures. 3) strings representing protein sequences. Using five methods for, protein sequence classification, one being our own, we found that the five methods gave information which is complementary to each other. Thus, using the five methods together, one can obtain high confidence classifications or suggest alternative hypotheses.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Intramural Research (Z01)
Project #
1Z01BC010045-02
Application #
6161135
Study Section
Special Emphasis Panel (LECB)
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
1997
Total Cost
Indirect Cost
Name
National Cancer Institute Division of Basic Sciences
Department
Type
DUNS #
City
State
Country
United States
Zip Code