One of the fundamental tasks in science is to infer the causal relationships between variables from data, and to discover hidden phenomena that may affect their outcome. We can attempt to automate this scientific process by searching over probabilistic models of how the observed data might be influenced by unobserved (latent) factors or variables. Machine learning of such models provides insight into the underlying domain and a means of predicting the latent factors. However, it is challenging to search over the exponentially many models, and existing algorithms are unable to scale to large amounts of data.
The goal of this CAREER award will provide novel algorithms to circumvent this computational intractability. Based on a classical idea in statistics called the method-of-moments, the new algorithms will be applied in bioinformatics to discover regulatory modules from disease expression profiles, and in health care to predict a patient's clinical state using data from their electronic medical record. A key component of the project is to involve high school students from disadvantaged backgrounds in the research to inspire them to pursue STEM careers.
The project advances machine learning by introducing several new techniques for unsupervised and semi-supervised learning of Bayesian networks. The project overcomes the computational challenges associated with maximum-likelihood estimation by developing new method-of-moment based algorithms for learning latent variable models, focusing on settings where inference itself may be intractable. This includes Bayesian networks of discrete variables where a top layer consists of latent factors and a bottom layer consists of the observed data, a form of discrete factor analysis. The proposed algorithms run in polynomial time and are guaranteed to learn a close approximation to the true model.
The techniques developed as part of this project have the potential to be transformative in the social and natural sciences by enabling the efficient and accurate discovery of latent variables from discrete data. Furthermore, in collaboration with emergency department clinicians, the new algorithms will be applied to learn models relating diseases to symptoms from noisy and incomplete data that is routinely collected as part of electronic medical records. This will advance the field of machine learning in health care by providing algorithms that generalize between institutions without the need for a large amount of labeled training data.
The insights about exploratory data analysis developed as part of this project will be integrated into innovative curriculum in data science, both as part of an undergraduate class and new Master's classes. The project will bring students from nearby high schools to NYU throughout the academic year and during the summer to learn about machine learning through participation in the proposed research, having them use the unsupervised learning algorithms to discover new medical insights. The PI will also develop and deliver tutorials on machine learning to clinicians and the health care industry.