This work involves a form of datamining, including the determination of motifs that may exist in a database of objects and the fast determination of distances between objects which may be used for clustering and data visualization. This becomes more significant when one has minimal information concerning the motifs. Ultimately, one would like to determine whether or not the set of motifs and the clusters discovered can act as good classifiers. The data objects dealt with may consist of sequences, trees, graphs or records. Examples of the use of these methods include: 1) the determination of 3D motifs in bio-molecules. The motifs that the algorithms find are rigid substructures which may occur in a graph after allowing for an arbitrary number of rotations and translations as well as a small number of node insert/delete operations in the motifs or graphs. By combining a geometric hashing"""""""" technique and """"""""block detection"""""""" algorithms for undirected graphs we are able to find motifs approximately in a set of graphs; 2) the determination of the largest approximately common substructures of two trees based on an edit distance metric. Using a method known as """"""""selective memorization"""""""", the algorithm was used to discover motifs in multiple RNA secondary structures which can be represented as trees; 3) sequence data, as mentioned above, can also be used for pattern discovery. Protein sequences were classified with a 98% precision rate; 4) more recently, a new index structure has been developed that takes a set of objects and a distance metric and then maps those objects into a k-dimensional space in such a way that the distances are approximately preserved. This index structure is a useful tool for clustering and visualization in data-intensive applications. Thus clustering of large databases can be made practical as for example in the clustering RNA conformations.

Agency
National Institute of Health (NIH)
Institute
Division of Basic Sciences - NCI (NCI)
Type
Intramural Research (Z01)
Project #
1Z01BC010045-07
Application #
6762691
Study Section
(LECB)
Project Start
Project End
Budget Start
Budget End
Support Year
7
Fiscal Year
2002
Total Cost
Indirect Cost
Name
Basic Sciences
Department
Type
DUNS #
City
State
Country
United States
Zip Code