BIGDATA: DA: Interpreting massive genomic data sets via summarization

Noble, William

Abstract

Genomic data is big and getting ever bigger, but current analysis methods will not scale to the analysis of thousands or millions of genomes. Consequently, a critical technical challenge is to develop new methods that can analyze these enormous data sets. In this proposal, we describe a new computational framework for drawing inferences from massive genomic data sets. Our approach leverages submodular summarization methods that have been developed for analyzing text corpora. We will apply these methods to five big data problems in genomics: 1) identifying functional elements characteristic o f a given human cell type;2) identifying genomic features associated with a particular subclass of cancer;3-4) identifying genomic variants representative of ancestrally or phenotypically defined human populations;and 5) finding a set of microbial genes that characterize a given site on the human body. This project will advance discovery and understanding on two fronts. First, we will develop novel methods for summarizing genomic, epigenomic and metagenomic data sets. Indeed, to our knowledge, this grant proposes the first application of summarization methods to genomic data of any kind. The proposed research will significantly advance our ability to apply submodularity to these summarization tasks, particularly with respect to identifying and creating a library of distance functions that have bee validated with respect to the five tasks outlined in the proposal. Second, we will apply our novel methods to problems of profound importance. Indeed, significant progress toward any one of our five tasks would represent an important advance in our scientific understanding of human history, biology or disease. The impact of this project will grow as the big data problem grows, even after the project is complete. The results of this project, both the software that we develop and the summaries that we produce, will be useful for answering a wide array of questions in any field that must cope with big data.

Public Health Relevance

Rapid advances in DNA sequencing technology have led to an explosion of genomic data. This data contains valuable knowledge about human biology and human disease, but few existing computational methods are designed to scale to the joint analysis of tens of thousands of human genomes. This proposal adapts and extends recent advances from the field of natural language processing to characterize cancer subtvoesdiscover ofinetic variants associated with disease and characterize human microbial populations.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Cancer Institute (NCI)
Type: Research Project (R01)
Project #: 5R01CA180777-02
Application #: 8642168
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Li, Jerry

Project Start: 2013-04-01
Project End: 2016-03-31
Budget Start: 2014-04-01
Budget End: 2015-03-31
Support Year: 2
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: University of Washington
Department: Genetics
Type: Schools of Medicine
DUNS #

City: Seattle
State: WA
Country: United States
Zip Code: 98195

Related projects


NIH 2015 R01 CA	BIGDATA: DA: Interpreting massive genomic data sets via summarization Noble, William Stafford / University of Washington
NIH 2014 R01 CA	BIGDATA: DA: Interpreting massive genomic data sets via summarization Noble, William Stafford / University of Washington
NIH 2013 R01 CA	BIGDATA: DA: Interpreting massive genomic data sets via summarization Noble, William Stafford / University of Washington	$214,832

Publications

Libbrecht, Maxwell W; Bilmes, Jeffrey A; Noble, William Stafford (2018) Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization. Proteins 86:454-466

Wei, Kai; Libbrecht, Maxwell W; Bilmes, Jeffrey A et al. (2016) Choosing panels of genomics assays using submodular optimization. Genome Biol 17:229

Libbrecht, Maxwell W; Noble, William Stafford (2015) Machine learning applications in genetics and genomics. Nat Rev Genet 16:321-32

Comments

Be the first to comment on William Noble's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: