Proteogenomics studies require combination and integration of mass spectrometry data (MS) for proteomics and next generation sequencing (NGS) data for genomics. This integration drastically increases the size of the data sets that need to be analyzed to make biological conclusions. However, existing tools yield low accuracy and exhibit poor scalability for big proteogenomics data. This CAREER grant is expected to lay a foundation for fast algorithmic and high performance computing solutions suitable for analyzing big proteogenomics data sets. Design of accurate computational algorithms suitable for peta-scale data sets will be pursued and the software implementation will run on massively parallel supercomputers and graphical processing units. The direction in this CAREER proposal is towards designing and building infrastructure, which would be useful for the broadest biological and ecological community. A comprehensive interdisciplinary education will be executed for K12, undergraduate and graduate students to ensure that US retains its global leadership position in STEM fields. This project thus serves the national interest, as stated by NSF's mission: to promote the progress of science and to advance the national health, prosperity and welfare.
The goal of the proposed CAREER grant is to design and develop algorithmic and high performance computing (HPC) foundations for practical sublinear and parallel algorithms for big proteogenomics data - especially for non-model organisms with previously unsequenced or partially sequenced genomes. Integration of MS and NGS data sets required for proteogenomics studies exhibit enormous volume and velocity of data: NGS technologies such as Chip-Seq can generate tera-bytes of DNA/RNA data and mass spectrometers can generate millions of spectra (with thousand of peak per spectra). The current systems for analyzing MS data are mainly driven by heuristic practices and do not scale well. This CAREER proposal will explore a new class of reductive algorithms for analysis of MS data that can allow peptide deductions in sublinear time, compression algorithms that operate in sub-linear space, and denovo algorithms that operate on lossy reduced-form of the MS data. Novel low-complexity sampling and reductive algorithms that can exploit the sparsity of MS data such as non-uniform FFT based convolution kernels can lead to superior similarity metrics not prone to spurious correlations. The bottleneck in large system-biology studies is the low-scalability of coarse-grained parallel algorithms that do not exploit MS-specific data characteristics and lead to unbalanced loads due to non-uniform compute time required for peptide deductions. This project aims to explore design and implementation of scalable algorithms for both NGS and MS data on multicore and GPU platforms using domain decomposition techniques based on spectral clustering, MS-specific hybrid load-balancing based on work-load estimate, and HPC dimensionality reduction strategies and novel out-of-core sketching & streaming fine-grained parallel algorithms. These HPC solutions can enable previously impractical proteogenomics projects and allow biologists to perform computational experiments without needing expensive hardware. All of the implemented algorithms will be made available as open-source code interfaced with Galaxy framework to ensure maximum impact in systems biology labs. These designed techniques will then be integrated so that matching of spectra to RNA-Seq data can be accomplished without a reconstructed transcriptome. The proposed tools aim to reveal new biological insight such as novel genes, proteins and PTM's and are crucial steps towards understanding the genomic, proteomic and evolutionary aspects of species in the tree of life.