l Recently we developed several spectral approaches for analyzing very large genomics datasets or complete databases that fall into the category of big data (BD). The first approach is designed to perform SVD or PCA based on randomization that can dramatically accelerate the computation of their eigenvectors and eigenvalues relative to the standard Lanczos algorithm implemented in all common software packages. Computing PCA and the SVD more efficiently could revolutionize the innumerable biomedical applications based on PCA and the SVD, e.g. population stratification in very large GWAS. These algorithms produce higher accuracy than classical (deterministic) methods, enable the processing of data streams that are too large to store, and parallelize easily to be used in multicore microprocessors. Our second novel approach is an unsupervised spectral learning method. It provides new mathematical insights of striking conceptual simplicity for ranking multiple competing algorithms without access to validation data and for optimally combining this ensemble of algorithms to obtain improved predictions in the absence of ground truth. Constructing a tool that provides end users an option to optimally rank or combine algorithms for analysis of genomics data is a practical and efficient solution to remove the confusion among end-users or bioinformaticians who are faced with the need to decide which tool to choose for their study, as a large number of biological results inferred by the different tools are often in disagreement. The choice of the best performing algorithm or pipeline is essential as it can often lead to substantial improvement in quality of the readout from massively parallel sequencing experiments. Moreover, combining these tools typically results in performance superior to the best performing algorithm. Our goal is to establish a team whose focus is to provide and disseminate full-blown implementations of spectral BD tools and methods that have broad applicability across the spectrum of biomedical sciences, clinical research, and healthcare delivery. Specifically we will develop scalable PCA and SVD for Genomics and biomedical applications, further advance our spectral method for ranking the performance of competing pipelines and combine them to achieve better predictions without access to validation data. Moreover, we will develop scalable dimensional reduction techniques for organizing BD from biomedical applications.
This research plan aims at developing scalable spectral computational tools suitable for analyzing very large genomics datasets or repositories of big biomedical data. Specifically our team which is led by a computational biologist and three prominent applied mathematicians will develop scalable dimensional reduction techniques which are essential for data exploration of big genomics or biomedical data, and suitable for ranking the performance of competing pipelines and combining them to achieve better predictions without access to validation data.
Mishne, Gal; Talmon, Ronen; Cohen, Israel et al. (2018) Data-Driven Tree Transforms and Metrics. IEEE Trans Signal Inf Process Netw 4:451-466 |
Katzman, Jared L; Shaham, Uri; Cloninger, Alexander et al. (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 18:24 |
Stanton, Kelly P; Jin, Jiaqi; Lederman, Roy R et al. (2017) Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling. Nucleic Acids Res 45:e173 |
Shaham, Uri; Stanton, Kelly P; Zhao, Jun et al. (2017) Removal of batch effects using distribution-matching residual networks. Bioinformatics 33:2539-2546 |
Li, Huamin; Shaham, Uri; Stanton, Kelly P et al. (2017) Gating mass cytometry data by deep learning. Bioinformatics 33:3423-3430 |
Li, Huamin; Linderman, George C; Szlam, Arthur et al. (2017) Algorithm 971: An Implementation of a Randomized Algorithm for Principal Component Analysis. ACM Trans Math Softw 43: |
Jiang, Tingting; Raviram, Ramya; Snetkova, Valentina et al. (2016) Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions. Nucleic Acids Res 44:8714-8725 |