Analytical Approaches to Massive Data Computation with Applications to Genomics

Upfal, Eliezer; Raphael, Benjamin

Abstract

We propose to design and test mathematically well founded algorithmic and statistical tectonics for analyzing large scale, heterogeneous and noisy data. We focus on fully analytical evaluation of algorithms' performance and rigorous statistical guarantees on the analysis results. This project will leverage on the PIs' recent work on cancer genomics data analysis and rigorous data mining techniques. Those works were driven by specific applications, while in the current project we aim at developing general principles and techniques that will apply to a broad sets of applications. The proposed research is transformative in its emphasis on rigorous analytical evaluation of algorithms' performance and statistical measures of output uncertainty, in contrast to the primarily heuristic approaches currently used in data ming and machine learning. While we cannot expect full mathematical analysis of all data mining and machine learning techniques, any progress in that direction will have significant contribution to the reliability and scientific impact of this discipline. While ou work is motivated by molecular biology data, we expect the techniques to be useful for other scientific communities with massive multi-variate data analysis challenges. Molecular biology provides an excellent source of data for testing advance data analysis techniques: specifically, DNA/RNA sequence data repositories are growing at a super-exponential rate. The data is typically large and noisy, and it includes both genotype and phenotype features that permit experimental validation of the analysis. One such data repository is The Cancer Genome Atlas (TCGA), which we will use for initial testing of the proposed approaches.

Public Health Relevance

This project will advocate a responsible approach to data analysis, based on well-founded mathematical and Statistical concepts. Such an approach enhances the effectiveness of evidence based medicine and other policy and social applications of big data analysis. The proposed work will be tested on human and cancer genome data, contributing to health IT, one of the National Priority Domain Areas.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Cancer Institute (NCI)
Type: Research Project (R01)
Project #: 4R01CA180776-04
Application #: 9015770
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Li, Jerry

Project Start: 2013-06-18
Project End: 2017-03-31
Budget Start: 2016-04-01
Budget End: 2017-03-31
Support Year: 4
Fiscal Year: 2016
Total Cost
Indirect Cost

Institution

Name: Brown University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 001785542

City: Providence
State: RI
Country: United States
Zip Code

Related projects


NIH 2016 R01 CA	Analytical Approaches to Massive Data Computation with Applications to Genomics Upfal, Eliezer; Raphael, Benjamin / Brown University
NIH 2015 R01 CA	Analytical Approaches to Massive Data Computation with Applications to Genomics Upfal, Eliezer; Raphael, Benjamin / Brown University
NIH 2014 R01 CA	Analytical Approaches to Massive Data Computation with Applications to Genomics Upfal, Eliezer; Raphael, Benjamin / Brown University	$69,189
NIH 2013 R01 CA	Analytical Approaches to Massive Data Computation with Applications to Genomics Upfal, Eliezer; Raphael, Benjamin / Brown University	$71,329

Publications

El-Kebir, Mohammed; Satas, Gryte; Raphael, Benjamin J (2018) Inferring parsimonious migration histories for metastatic cancers. Nat Genet 50:718-726

Cancer Genome Atlas Research Network. Electronic address: andrew_aguirre@dfci.harvard.edu; Cancer Genome Atlas Research Network (2017) Integrated Genomic Characterization of Pancreatic Ductal Adenocarcinoma. Cancer Cell 32:185-203.e13

Nakka, Priyanka; Archer, Natalie P; Xu, Heng et al. (2017) Novel Gene and Network Associations Found for Acute Lymphoblastic Leukemia Using Case-Control and Family-Based Studies in Multiethnic Populations. Cancer Epidemiol Biomarkers Prev 26:1531-1539

Leiserson, Mark D M; Reyna, Matthew A; Raphael, Benjamin J (2016) A weighted exact test for mutually exclusive mutations in cancer. Bioinformatics 32:i736-i745

Vandin, Fabio; Raphael, Benjamin J; Upfal, Eli (2016) On the Sample Complexity of Cancer Pathways Identification. J Comput Biol 23:30-41

El-Kebir, Mohammed; Satas, Gryte; Oesper, Layla et al. (2016) Inferring the Mutational History of a Tumor Using Multi-state Perfect Phylogeny Mixtures. Cell Syst 3:43-53

Nakka, Priyanka; Raphael, Benjamin J; Ramachandran, Sohini (2016) Gene and Network Analysis of Common Variants Reveals Novel Associations in Multiple Complex Diseases. Genetics 204:783-798

Leiserson, Mark D M; Vandin, Fabio; Wu, Hsin-Ta et al. (2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet 47:106-14

Leiserson, Mark D M; Wu, Hsin-Ta; Vandin, Fabio et al. (2015) CoMEt: a statistical approach to identify combinations of mutually exclusive alterations in cancer. Genome Biol 16:160

Raphael, Benjamin J; Vandin, Fabio (2015) Simultaneous inference of cancer pathways and tumor progression from cross-sectional mutation data. J Comput Biol 22:510-27

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on Eliezer Upfal's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: