Despite the tremendous accomplishments of machine learning and deep learning in the past decade, challenges remain for structurally complex and diverse data. For example, a single data point in a database used for drug design might have tens of thousands of internal degrees of freedom, and such a database may have tens of thousands of such data points. This feature of structural complexity is a major challenge to deep learning methods. Moreover, diverse data typically originate from sparse sampling of a huge space, and this sparsity is due, in particular, to the cost and time constraints in experimental data acquisition. This project will address the challenges of complex and diverse datasets with ideas that blend and integrate mathematical techniques from several subfields including algebraic topology, spectral graph theory and multiscale analysis. The methods developed will apply to data representation, advanced machine learning methods, and deep learning algorithms, and will be implemented into software packages available to the community. This project will train graduate and undergraduate students and engage underrepresented groups in data science research.
This project will develop novel topology and graph theory-based approaches to revolutionize the current practice in data analysis and to deal with the challenge of structurally complex data and diverse data. First, the investigators will develop persistent combinatorial graph theory as a unified paradigm for simultaneous topological data analysis and spectral data analysis. In particular, they will develop systematic, scalable, accurate persistent combinatorial graph representations to extract rich topological and spectral information. Secondly, the investigators will develop multiscale graph models to create a family of nested submanifolds to handle the diverse data originated from sparsely sampled data points in a huge space. These methods will be integrated with advanced machine learning and deep learning algorithms for complex and diverse datasets. Thirdly, the proposed methods will be applied to a wide range of case studies in data science. User-friendly software packages and online servers will be developed using parallel and GPU architectures for researchers who are not formally trained in mathematics or machine learning.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.