Proteins are key biomolecules involved in virtually all processes within cells, e.g., metabolism, cell signaling, immune response, etc., and knowledge of protein function is vital to obtain a basic understanding of cellular activity. Due to recent advances in nucleotide sequencing technology, the number of available genomic sequences is doubling in size roughly every 12 months, an incredibly fast pace vastly exceeding Moore's law. Experimental technologies required to decipher protein function have not progressed nearly as fast. In fact, although there are roughly 10 million protein sequences in the comprehensive Uniprot database, only 0.2% have experimentally validated function annotations. This sequence-function gap is rapidly expanding, and the development of computational methods is of crucial importance to effectively utilize this deluge of sequence data.
In this work, we develop SIFTER, a large-scale, systems biology platform to accurately predict protein function from high-throughput data. Building upon a promising phylogenomic-based prototype, we incorporate interaction networks into our model to improve performance. Interaction data intrinsically couples the thousands to millions of proteins within such networks, and we use variational inference and parallelized implementations to address this challenging computational problem. We also explore techniques for function prediction based on low-rank matrix factorization, and along the way, introduce novel sampling-based approaches to speed up computation. Additionally, we develop algorithms to quantify uncertainty in SIFTER's predictions to help guide future experimental work. These novel algorithms are large-scale extensions to classical bootstrap sampling and are generally applicable to any problem involving massive data. Finally, we evaluate SIFTER in collaboration with experimental biologists, allowing us to pinpoint relevant use cases and resulting in an effective method with widespread impact within the biomedical community.
The rise of Big Data poses major new challenges for computational science, as scalability and ease-of-use roadblocks hinder the widespread adoption of advanced statistical machine learning techniques. In terms of scalability, data analyses suitable for modest-sized datasets are often entirely infeasible for the massive and complex datasets being collected via the internet and in domains including genomics, astronomy, physics and finance. In terms of ease-of-use, successfully deploying data processing pipelines is currently a highly manual process, and is fraught with pitfalls for analysts lacking strong statistical backgrounds. My research as part of this NSF postdoctoral fellowship has helped to overcome these scalability and ease-of-use roadblocks. I have developed various scalable and theoretically principled machine learning algorithms that are currently state-of-the-art methods for matrix factorization, error bar estimation, and genomic variant calling. I also led the initial development of MLlib, a distributed machine learning library that is part of the Apache Spark project. Additionally, I created SMaSH, the first genomic benchmarking platform to comprehensively evaluate both the accuracy and computational time / cost of various variant calling methods, i.e., methods to process next-generation genomic sequencing data. My work has the potential to impact various fields in science and technology generating massive amounts of data. MLlib is easy-to-use and provides state-of-the-art methods for common predictive analytics use-cases, and in less than two years has become one of the most commonly used distributed machine learning libraries in practice. Several companies have recently started to build on top of the Apache Spark system, of which MLlib is a key component. Moreover, the SMaSH benchmark serves as a prototype for the benchmarking activities of the Global Alliance for Global Health (GA4GH), an international coalition dedicated to improving human health by maximizing the potential of genomic medicine through effective and responsible data sharing. SMaSH has the potential to help shape benchmarking protocols for the field of genomic variant calling, which itself is instrumental to the field of personalized medicine. Finally, my NSF postdoctoral fellowship has provided opportunities for training and professional development, both for others and for myself. During this fellowship, I co-authored a graduate level textbook about machine learning called Foundations of Machine Learning. This book has been widely cited since its publication and has been adopted as the main textbook for several graduate level classes at various universities around the world. Researchers and practitioners are likely to learn about techniques for large-scale data analysis and machine learning via educational material developed directly (e.g., MLlib documentation and hands-on tutorials I developed) or indirectly as a result of our research (e.g., a MOOC on Scalable Machine Learning based on my work that I am in the process of developing). Additionally, I supervised multiple students and researchers during my NSF fellowship and provided guidance on their subsequent employment opportunities. This fellowship served as excellent training for me, and helped me obtain my current academic position; in the Fall 2014 I started an assistant professorship in the Computer Science Department at UCLA.