Knowledge bases organize information into graphs of entities, and data exploration algorithms can leverage mathematical properties of these graphs to discover interesting and useful insights about the entities and their relationships. For example, data exploration algorithms can use the graph of Google Knowledge Base to identify people who have common interests, and can discover genes with similar behavior by analyzing the graph of the Genome Knowledge Base. Currently, data exploration tools tend to be quite sensitive to the details of how information is represented in these graphs, making the tools highly effective over some choices of representation but not so effective with others. As a result, data exploration has largely remained the province of experts and data scientists. This project seeks to overcome this dependency and enable a new generation of more general data exploration tools that ordinary users can use to explore data on their own, without an expert by their side.
More specifically, this project is creating effective similarity and proximity search algorithms that deliver the same results over various choices of representation for the underlying knowledge base. The key idea of the project is to use statistical metrics to quantify the degree of similarity between entities or patterns, in a manner that is not sensitive to the specific representation of the data. This novel theoretical framework serves as the foundation of more general data exploration algorithms, whose generality and effectiveness is being validated on large real-world knowledge bases.