Virtually every human endeavor in a broad variety of disciplines in science, engineering, and finance, is encountering the need to make new discoveries through the analysis of massive datasets. Yet, the task of creating the necessary scalable analysis tools for modern high-performance parallel hardware is a daunting task, due to the complexity of developing efficient algorithms and subsequently having to parallelize and tune these algorithms for real systems. This research project aims to address this problem by developing a new domain-specific programming model, called the Tree-based High-Order Reduce (THOR) model, that enables rapid automatic implementation of customized parallel data analysis and mining tasks with minimal coding effort.
The key enabling insight behind THOR is the generalized n-body problem (GNP) theory, a mathematical formalism that unifies the expression of seemingly disparate statistical data analysis tasks, including n-point correlation, hierarchical clustering, k-nearest neighbors classification, and kernel density estimation, among numerous others. As its name suggests, a GNP elegantly generalizes the classical n-body problem from physics to a much broader class of problems. Most importantly, the GNP form permits the development of asymptotically fast solutions, e.g., generalized versions of the fast multipole method. The THOR model enables the data analyst to specify a GNP, from which the THOR program generator can automatically produce a highly tuned parallel implementation. In short, this project aims to show how a programming model, which is bound to an appropriately high-level mathematical formalism while having the simplicity of a model like MapReduce, can lead to both scalable data analysis algorithms and their efficient implementation.