An increasing amount of data is stored in an interconnected manner. Such data range from the Web; hyperlinked pages; to bibliographical data; graph of citations; to biological data; associations between proteins, genes, and publications; to clinical data; associations between patients, hospitalizations, exams and diagnoses. A critical need in order to leverage the available data is the enablement of information discovery, i.e., given a question (query) find pieces of data or associations between them in the data graph that are "good" (relevant, authoritative and specific) for the query, and rank them according to their "goodness". Submitting such queries should not require knowledge of a complex query language (e.g., SQL) or of the details of the data (e.g., schema). Unfortunately, little has been done to provide high-quality information discovery on data graphs in domains other than the Web, where search engines have been successful. This project is expected to have the following broader impacts: (a) Promote participation of FIU (one of the largest Hispanic institutes in the country) minority students in the research process, in the form of independent or senior class projects. (b) Facilitate effective information discovery on biological and clinical data, which can lead to cost savings, and increased research productivity in these domains. The results will be disseminated through publications, public Web demo systems, and the project Web site (http://dblab.cs.ucr.edu/projects/DGID/).
The research performed in this project facilitates searching complex data more effectively. In particular, this work focuses on interconnected data like patents, e-commerce products, social networks or health records - for instance, each patient health record consists of data items like the patientâ€™s age and diseases, and is linked to other patient records that have the same diseases. Algorithms were developed to decide how to best rank the results of queries on complex interconnected data, such that the most relevant results are displayed first. Further, methods were developed to allow users to navigate the results, such that they can satisfy their search need - for instance, finding a desired set of used cars - in the minimum expected time. In addition to that, techniques were developed to decide which attributes of each result to display to the user, to let him/her decide if a result is relevant, but avoid overwhelming the user with too much information. All developed techniques were experimentally evaluated in terms of their time and quality performance. The former experiments show that the algorithms have fast response times, whereas the latter show that users prefer a system that uses our algorithms compared to the state-of-the-art. This project has produced two PhD dissertations and involved three MS and ten undergraduate students into the research process. The potential impact of this project to the society is that users that need to search complex data may improve their productivity by finding their desired information faster and more effectively. For instance, biologists can more effectively discover existing knowledge and hence spend more time on their research.