Genomic data is big and getting ever bigger, but current analysis methods will not scale to the analysis of thousands or millions of genomes. Consequently, a critical technical challenge is to develop new methods that can analyze these enormous data sets. In this proposal, we describe a new computational framework for drawing inferences from massive genomic data sets. Our approach leverages submodular summarization methods that have been developed for analyzing text corpora. We will apply these methods to five big data problems in genomics: 1) identifying functional elements characteristic o f a given human cell type;2) identifying genomic features associated with a particular subclass of cancer;3-4) identifying genomic variants representative of ancestrally or phenotypically defined human populations;and 5) finding a set of microbial genes that characterize a given site on the human body. This project will advance discovery and understanding on two fronts. First, we will develop novel methods for summarizing genomic, epigenomic and metagenomic data sets. Indeed, to our knowledge, this grant proposes the first application of summarization methods to genomic data of any kind. The proposed research will significantly advance our ability to apply submodularity to these summarization tasks, particularly with respect to identifying and creating a library of distance functions that have bee validated with respect to the five tasks outlined in the proposal. Second, we will apply our novel methods to problems of profound importance. Indeed, significant progress toward any one of our five tasks would represent an important advance in our scientific understanding of human history, biology or disease. The impact of this project will grow as the big data problem grows, even after the project is complete. The results of this project, both the software that we develop and the summaries that we produce, will be useful for answering a wide array of questions in any field that must cope with big data.

Public Health Relevance

Rapid advances in DNA sequencing technology have led to an explosion of genomic data. This data contains valuable knowledge about human biology and human disease, but few existing computational methods are designed to scale to the joint analysis of tens of thousands of human genomes. This proposal adapts and extends recent advances from the field of natural language processing to characterize cancer subtvoesdiscover ofinetic variants associated with disease and characterize human microbial populations.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Schools of Medicine
United States
Zip Code
Libbrecht, Maxwell W; Bilmes, Jeffrey A; Noble, William Stafford (2018) Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization. Proteins 86:454-466
Wei, Kai; Libbrecht, Maxwell W; Bilmes, Jeffrey A et al. (2016) Choosing panels of genomics assays using submodular optimization. Genome Biol 17:229
Libbrecht, Maxwell W; Noble, William Stafford (2015) Machine learning applications in genetics and genomics. Nat Rev Genet 16:321-32