Parallel Algorithms for Big Data from Mass Spectrometry based Proteomics

Saeed, Fahad

Abstract

The goal of the proposed project is to develop core algorithms, techniques and software libraries to enable scalable, efficient and parallel computing solutions for mass spectrometry (MS) based high-throughput proteomics data sets. To empower the larger proteomics community and experimental biologist the project seeks to 1) identify a set of core methods that are frequently used by proteomics practitioners 2) develop efficient and scalable parallel algorithms and implementations for these methods 3) pursue mapping of these parallel computing techniques to a wide variety of architectures such as multicores, manycores, distributed clusters, GPU?s and FPGA?s 4) design and implement big data analytic techniques that can be used in our HPC implementation as well as used by other researchers for sequential and/or parallel algorithms 5) design interfaces using Galaxy framework for these parallel programs so that they can be used by non-experts and people who are not familiar with parallel processing. The research will be conducted in collaboration with domain experts in systems biology and proteomics. The specific problems that will be targeted are parallel algorithms for clustering of MS data sets, parallel algorithms for identifying peptides using databases from these MS data sets using multicore and GPU?s and high performance algorithms that can make sense out of these MS data sets in a denovo fashion without a need for a database. The parallel algorithms will be tested using simulated as well as real experimental data sets and will be available for free academic use.

Public Health Relevance

Analysis of high-throughput proteomics data is an essential task in experimental and computational biology. Thousands of spectra are generated from high-throughput mass spectrometers from a single run of experiment and can scale to up to a billion spectra and peta-byte level. The big data that one gets from these high-throughput techniques is so large that no matter how good the conventional techniques are they will never be able to keep up with the rate of these data sets. Existing proteomics data analysis solutions are limited in their capability and yield poor performance for large data sets. The goal of the proposed project is to develop core algorithms, techniques and software libraries to enable scalable, efficient and parallel computing solutions for mass spectrometry based high-throughput proteomics. Since the proposed algorithms will exploit ubiquitous/low-cost multicore, manycore and GPU architecture?s, the proposed research will have a significant impact in experimental biology research labs. Interfacing of the proposed algorithms with framework like Galaxy that are familiar to experimental biologists will pursued. This will allow widest dissemination of the high performance algorithms to experimental biology labs. Further this will have significant impact for domain proteomics scientists since using HPC algorithms these scientists can perform much more complex and accurate analysis than was previously possible. The efficiency and portability of our proposed techniques will have seminal impact in precision and personal medicine.