New insights with machine learning exists across may domains, including, for example, medicine, social media, image processing, biology, and computer and network security. Machine learning is able to process large, high-dimensional data sets that are beyond human capabilities. One emerging method of machine learning is based on a branch of mathematics called topology that is sometimes able to discover knowledge that is not available using conventional methods. The field of topology is concerned with of the shape of an object and Persistent Homology is the critical method in topology used to extract the features of a shape. Persistent Homology will classify an object by the size and number of holes and voids in that object. Unfortunately, computing the Persistent Homology for an object requires significant amounts of memory and long run-times that increases exponentially in the number of points that forms that object. This project will treat the object formed by the data and subdivide it into smaller regions for the parallel computation of Persistent Homology on each region. The results from the regional analyses will then be assembled together and any duplicate or missing results will be identified and restored in a post analysis step. The computation on all of the regions will be completed in substantially less time and in much less total memory than a single computation on the entire data set. Testing of the methods developed will be performed using a variety of synthetic and real-world data. The synthetic data will permit controlled studies on performance and scalability. Realworld data from a variety of sources and especially data where the small topological features are significant (such as data from brain scans) will be used. This project will propel the application of topology based analysis to discover new insights and meaningful information from massive high-dimensional data. An expansion of student training in data mining through topological-based methods will be achieved with the addition of classes, projects (senior project, MS Theses, PhD Dissertations, and so on), seminars, and research co-op training experiences. Students at all levels will be impacted and special emphasis placed on minority and underrepresented student groups participation. This project will also participate in the Women in Science and Engineering programs at UC. The project investigators will engage local area K-12 students, international exchange students and researchers at UC's collaborative institutions, UC's Medical School, Cincinnati Children's Medical Center, the Air Force Research Lab, and local industries with information and seminars on this project investigations and results.
This project proposes to combine the fields of Approximate Computing with Topological Data Analysis to dramatically reduce the computational and memory requirements to use Topological Data Analysis on very large data sets. In particular, this project will develop approximate methods for computing Persistent Homology that dramatically increase the sizes of data sets for which data mining methods based on topological data analysis can be applied. This project expects to increase the size of the input data set that can be analyzed by Topological Data Analysis methods by at least 3-5 orders of magnitude. While approximate methods can introduce error, the features identified by the approximate methods will identify regions of the point cloud where an upscaling steps and regional computations of Persistent Homology can be used (in parallel) to establish more precise boundaries of those features. The project will develop algorithmic improvements, formal statements on the correctness, error bounds, and complexities of the algorithms and approximation techniques. These techniques have important implications on the ability to apply topological data analysis techniques to much larger data sets than currently possible.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.