The wide use of the internet coupled with the steadily decreasing cost in computing and storage has led to an expansion of the data that users expect to retrieve from simple numeric and alphanumeric, to include images,audio, video, where the retrieval criterion is one of similarity. An inherent difficulty with similarity retrieval is deciding on a criterion for the similarity.
Intellectual Merit
This proposal explores issues involved in retrieval that is based on several criteria of similarity:
(1) In the spatial data context, similarity is usually defined in terms of proximity in spatial position. Instead, this research will examine similarity in relative spatial orientation. Thus there will be a focus on groups of objects, In particular, attention will be paid to position-independent indexes that find use in applications involving pictorially-specified queries to symbolic image databases. (2) Issues involved in finding the k nearest types of objects rather than the k nearest objects will be explored. At present, it is the type of the objects that is of interest rather than their identity. This differentiation is similar to the classic distinction made in spatial databases between location-based and feature-based queries. (3) Also of interest will be finding similarity between sets of objects where the similarity measure is the maximum of the minimum distance between objects in the two sets (the Hausdorff distance). Of particular interest will be methods that are incremental so that data is obtained as quickly as possible without waiting for the algorithms to complete.
Broad Impacts
Position-independent indexing has application to pictorially-specified image databases as well as to databases of moving objects which can be used for searches that include traffic data. The Hausdorff distance is applicable to multimedia databases as is the focus on retrieval by type rather than individual objects. Via a collaboration with the National Cancer Institute, the research team will investigate what makes a good similarity measure for bioinformatics applications.
The goal of this project was to investigate issues in similarity retrieval over a wide spectrum of data types in applications including, but not limited to, document clustering, point set matching, and bioinformatics. The emphasis was on finding nearest neighbors using different similarity criteria. The key outcomes of the project can be divided into several parts which are differentiated by the corresponding problem domains. Our primary focus was on similarity searching in the document domain where our domain was streaming news articles and our goal was to automatically group news articles into sets of documents, termed ``clusters'', such that each cluster should only contain the documents encountered in the input stream seen so far pertaining to a specific topic. This problem is a variant of similarity searching in a very high-dimensional vector space, where dimensions correspond to the unique words in English language. Our work was motivated by our desire to display the clusters on a map associated with their principal geographic location. This requirement differentiated our approach from conventional ones in that it had to be an ``online'' approach which means that the clusters are formed and updated as the date is encountered rather than having to wait until all the data has been input. The reason we needed the online approach is that the news articles are constantly streaming in and thus must be placed in the appropriate cluster so that the results can be seen immediately. We also investigated the problem of similarity search over a collection of point sets. The ability to identify similar point sets with respect to a reference point set Q is useful in a number of applications. For example, given a geographical distribution of a current disease outbreak represented as a location set Q, an epidemiologist may wish to find k occurrences of outbreaks (from a set D of historical outbreak distributions) that are most similar to Q. These results can then be used to help identify correlations between the outbreak in question and other outbreaks. As another example, letting Q denote a location set of warehouses of a logistics company and D denote a collection of location sets of gas stations where each location set contains locations of gas stations within a single oil company, to form a partnership with an oil company, the logistics company may wish to find the oil company whose location set S minimizes the maximum distance from each warehouse in Q to the nearest station in S. For both of these problems we used the Hausdorff distance , which is a measure commonly used to determine the maximum discrepancy between two point sets. Intuitively, a set X is considered similar to another point set Y iff every point in X is close to at least one point in Y. Hausdorff distance HausDist(X,Y) can be computed as the Max-Min distance from X to Y, i.e., find the maximum of the distance from an element in X to its nearest neighbor (NN) in Y. We decomposed this problem into two subproblems: derivation of a Hausdorff distance computation algorithm, and formulation of a similarity search algorithm. We also started work on the problem of similarity search over an RNA junction database. Querying a collection of biomolecular structures such as DNA, protein and RNA structures is a complex task due to a large amount of data to process. For example, one RNA/DNA strand may contain up to thousands of base-pairs in length. The high potential of reducing the query processing cost as well as the large spectrum of associated applications have attracted considerable research attention from the bioinformatics and database communities. In our research, we investigated the problem of identifying similar biomolecular structures based on certain geometric properties. Applications that may benefit from similarity search over biomolecular structures include nano-structure design, drug design and drug discovery. Overall our work showed the wide domain of problems to which similarity searching techniques find use, and the benefit of expending more work on it.