Big data is important right now; and extreme scale hardware, programming models and storage solutions are being developed. However the connection between transformational algorithms and the big data sets needs to be addressed. Researchers talk often about the upcoming "big data problem," yet skim over the relevant algorithms that can truly attack the problems in a concrete fashion, often because the traditional means of data analysis are not prevalent in the high-performance computing (HPC) world, and in concert, the HPC expertise is not prevalent in the data world. This project will solidify this connection, making the "big data" problem real by attacking the issues through identifiable concrete algorithmic research and by implementing the methods on the latest hardware platforms. Recently the principal investigator has developed scalable desktop algorithms that bridge that gap, leveraging fast spectral solvers to compute solutions of sparse classification problems for big data. This project will build on these scalable algorithms to implement them on several large-scale platforms and will address important application areas, for which desktop computing is insufficient. Examples of application areas for this project include high dimensional hyperspectral video data for chemical and biological agents, a problem of importance to homeland security, and statistical analysis of spatio-temporal multimodal crime data, and large-scale social network analysis.
This project focuses on a new class of data-clustering algorithms that are designed to solve variants of the minimum cut problem on graphs for big data applications such as hyperspectral video data analysis, statistical analysis of spatio-temporal multimodal crime data, and large-scale social network analysis. Semi-supervised and unsupervised machine learning problems are included in the class of problems considered. The graph mincut problem is equivalent to total variation minimization on a graph and is a popular model for machine learning applications, except for its computational complexity. Building on ideas such as diffuse interfaces and dynamic thresholding, originally developed for physical sciences models and subsequently transferred to low dimensional image processing applications, this project will develop methods to solve the true graph cut problem by leveraging recent advances in scalable spectral graph algorithms. New codes for these methods will be developed for large parallel architectures. The research will advance both theoretical algorithmic issues and application areas.