Efficient Spectral Approaches for Finding Underlying Structures in Big Data

Kluger, Yuval

Abstract

l Recently we developed several spectral approaches for analyzing very large genomics datasets or complete databases that fall into the category of big data (BD). The first approach is designed to perform SVD or PCA based on randomization that can dramatically accelerate the computation of their eigenvectors and eigenvalues relative to the standard Lanczos algorithm implemented in all common software packages. Computing PCA and the SVD more efficiently could revolutionize the innumerable biomedical applications based on PCA and the SVD, e.g. population stratification in very large GWAS. These algorithms produce higher accuracy than classical (deterministic) methods, enable the processing of data streams that are too large to store, and parallelize easily to be used in multicore microprocessors. Our second novel approach is an unsupervised spectral learning method. It provides new mathematical insights of striking conceptual simplicity for ranking multiple competing algorithms without access to validation data and for optimally combining this ensemble of algorithms to obtain improved predictions in the absence of ground truth. Constructing a tool that provides end users an option to optimally rank or combine algorithms for analysis of genomics data is a practical and efficient solution to remove the confusion among end-users or bioinformaticians who are faced with the need to decide which tool to choose for their study, as a large number of biological results inferred by the different tools are often in disagreement. The choice of the best performing algorithm or pipeline is essential as it can often lead to substantial improvement in quality of the readout from massively parallel sequencing experiments. Moreover, combining these tools typically results in performance superior to the best performing algorithm. Our goal is to establish a team whose focus is to provide and disseminate full-blown implementations of spectral BD tools and methods that have broad applicability across the spectrum of biomedical sciences, clinical research, and healthcare delivery. Specifically we will develop scalable PCA and SVD for Genomics and biomedical applications, further advance our spectral method for ranking the performance of competing pipelines and combine them to achieve better predictions without access to validation data. Moreover, we will develop scalable dimensional reduction techniques for organizing BD from biomedical applications.

Public Health Relevance

This research plan aims at developing scalable spectral computational tools suitable for analyzing very large genomics datasets or repositories of big biomedical data. Specifically our team which is led by a computational biologist and three prominent applied mathematicians will develop scalable dimensional reduction techniques which are essential for data exploration of big genomics or biomedical data, and suitable for ranking the performance of competing pipelines and combining them to achieve better predictions without access to validation data.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG008383-01A1
Application #: 9029445
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Wellington, Christopher

Project Start: 2016-05-24
Project End: 2019-04-30
Budget Start: 2016-05-24
Budget End: 2017-04-30
Support Year: 1
Fiscal Year: 2016
Total Cost
Indirect Cost

Institution

Name: Yale University
Department: Microbiology/Immun/Virology
Type: Schools of Medicine
DUNS #: 043207562

City: New Haven
State: CT
Country: United States
Zip Code

Related projects


NIH 2018 R01 HG	Efficient Spectral Approaches for Finding Underlying Structures in Big Data Kluger, Yuval / Yale University
NIH 2017 R01 HG	Efficient Spectral Approaches for Finding Underlying Structures in Big Data Kluger, Yuval / Yale University	$368,999
NIH 2016 R01 HG	Efficient Spectral Approaches for Finding Underlying Structures in Big Data Kluger, Yuval / Yale University

Publications

Mishne, Gal; Talmon, Ronen; Cohen, Israel et al. (2018) Data-Driven Tree Transforms and Metrics. IEEE Trans Signal Inf Process Netw 4:451-466

Katzman, Jared L; Shaham, Uri; Cloninger, Alexander et al. (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18:24

Li, Huamin; Linderman, George C; Szlam, Arthur et al. (2017) Algorithm 971: An Implementation of a Randomized Algorithm for Principal Component Analysis. ACM Trans Math Softw 43:

Stanton, Kelly P; Jin, Jiaqi; Lederman, Roy R et al. (2017) Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling. Nucleic Acids Res 45:e173

Shaham, Uri; Stanton, Kelly P; Zhao, Jun et al. (2017) Removal of batch effects using distribution-matching residual networks. Bioinformatics 33:2539-2546

Li, Huamin; Shaham, Uri; Stanton, Kelly P et al. (2017) Gating mass cytometry data by deep learning. Bioinformatics 33:3423-3430

Jiang, Tingting; Raviram, Ramya; Snetkova, Valentina et al. (2016) Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions. Nucleic Acids Res 44:8714-8725

Comments

Be the first to comment on Yuval Kluger's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: