l Recently we developed several spectral approaches for analyzing very large genomics datasets or complete databases that fall into the category of big data (BD). The first approach is designed to perform SVD or PCA based on randomization that can dramatically accelerate the computation of their eigenvectors and eigenvalues relative to the standard Lanczos algorithm implemented in all common software packages. Computing PCA and the SVD more efficiently could revolutionize the innumerable biomedical applications based on PCA and the SVD, e.g. population stratification in very large GWAS. These algorithms produce higher accuracy than classical (deterministic) methods, enable the processing of data streams that are too large to store, and parallelize easily to be used in multicore microprocessors. Our second novel approach is an unsupervised spectral learning method. It provides new mathematical insights of striking conceptual simplicity for ranking multiple competing algorithms without access to validation data and for optimally combining this ensemble of algorithms to obtain improved predictions in the absence of ground truth. Constructing a tool that provides end users an option to optimally rank or combine algorithms for analysis of genomics data is a practical and efficient solution to remove the confusion among end-users or bioinformaticians who are faced with the need to decide which tool to choose for their study, as a large number of biological results inferred by the different tools are often in disagreement. The choice of the best performing algorithm or pipeline is essential as it can often lead to substantial improvement in quality of the readout from massively parallel sequencing experiments. Moreover, combining these tools typically results in performance superior to the best performing algorithm. Our goal is to establish a team whose focus is to provide and disseminate full-blown implementations of spectral BD tools and methods that have broad applicability across the spectrum of biomedical sciences, clinical research, and healthcare delivery. Specifically we will develop scalable PCA and SVD for Genomics and biomedical applications, further advance our spectral method for ranking the performance of competing pipelines and combine them to achieve better predictions without access to validation data. Moreover, we will develop scalable dimensional reduction techniques for organizing BD from biomedical applications.

Public Health Relevance

This research plan aims at developing scalable spectral computational tools suitable for analyzing very large genomics datasets or repositories of big biomedical data. Specifically our team which is led by a computational biologist and three prominent applied mathematicians will develop scalable dimensional reduction techniques which are essential for data exploration of big genomics or biomedical data, and suitable for ranking the performance of competing pipelines and combining them to achieve better predictions without access to validation data.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG008383-02
Application #
9278252
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Wellington, Christopher
Project Start
2016-05-24
Project End
2019-04-30
Budget Start
2017-05-01
Budget End
2018-04-30
Support Year
2
Fiscal Year
2017
Total Cost
$368,999
Indirect Cost
$129,119
Name
Yale University
Department
Microbiology/Immun/Virology
Type
Schools of Medicine
DUNS #
043207562
City
New Haven
State
CT
Country
United States
Zip Code
06520
Mishne, Gal; Talmon, Ronen; Cohen, Israel et al. (2018) Data-Driven Tree Transforms and Metrics. IEEE Trans Signal Inf Process Netw 4:451-466
Katzman, Jared L; Shaham, Uri; Cloninger, Alexander et al. (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18:24
Stanton, Kelly P; Jin, Jiaqi; Lederman, Roy R et al. (2017) Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling. Nucleic Acids Res 45:e173
Shaham, Uri; Stanton, Kelly P; Zhao, Jun et al. (2017) Removal of batch effects using distribution-matching residual networks. Bioinformatics 33:2539-2546
Li, Huamin; Shaham, Uri; Stanton, Kelly P et al. (2017) Gating mass cytometry data by deep learning. Bioinformatics 33:3423-3430
Li, Huamin; Linderman, George C; Szlam, Arthur et al. (2017) Algorithm 971: An Implementation of a Randomized Algorithm for Principal Component Analysis. ACM Trans Math Softw 43:
Jiang, Tingting; Raviram, Ramya; Snetkova, Valentina et al. (2016) Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions. Nucleic Acids Res 44:8714-8725