The goal of the proposed project is to develop core algorithms, techniques and software libraries to enable scalable, efficient and parallel computing solutions for mass spectrometry (MS) based high-throughput proteomics data sets. To empower the larger proteomics community and experimental biologist the project seeks to 1) identify a set of core methods that are frequently used by proteomics practitioners 2) develop efficient and scalable parallel algorithms and implementations for these methods 3) pursue mapping of these parallel computing techniques to a wide variety of architectures such as multicores, manycores, distributed clusters, GPU?s and FPGA?s 4) design and implement big data analytic techniques that can be used in our HPC implementation as well as used by other researchers for sequential and/or parallel algorithms 5) design interfaces using Galaxy framework for these parallel programs so that they can be used by non-experts and people who are not familiar with parallel processing. The research will be conducted in collaboration with domain experts in systems biology and proteomics. The specific problems that will be targeted are parallel algorithms for clustering of MS data sets, parallel algorithms for identifying peptides using databases from these MS data sets using multicore and GPU?s and high performance algorithms that can make sense out of these MS data sets in a denovo fashion without a need for a database. The parallel algorithms will be tested using simulated as well as real experimental data sets and will be available for free academic use.

Public Health Relevance

Analysis of high-throughput proteomics data is an essential task in experimental and computational biology. Thousands of spectra are generated from high-throughput mass spectrometers from a single run of experiment and can scale to up to a billion spectra and peta-byte level. The big data that one gets from these high-throughput techniques is so large that no matter how good the conventional techniques are they will never be able to keep up with the rate of these data sets. Existing proteomics data analysis solutions are limited in their capability and yield poor performance for large data sets. The goal of the proposed project is to develop core algorithms, techniques and software libraries to enable scalable, efficient and parallel computing solutions for mass spectrometry based high-throughput proteomics. Since the proposed algorithms will exploit ubiquitous/low-cost multicore, manycore and GPU architecture?s, the proposed research will have a significant impact in experimental biology research labs. Interfacing of the proposed algorithms with framework like Galaxy that are familiar to experimental biologists will pursued. This will allow widest dissemination of the high performance algorithms to experimental biology labs. Further this will have significant impact for domain proteomics scientists since using HPC algorithms these scientists can perform much more complex and accurate analysis than was previously possible. The efficiency and portability of our proposed techniques will have seminal impact in precision and personal medicine.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Academic Research Enhancement Awards (AREA) (R15)
Project #
1R15GM120820-01A1
Application #
9301702
Study Section
Special Emphasis Panel (ZRG1-BST-W (80)A)
Program Officer
Ravichandran, Veerasamy
Project Start
2017-04-01
Project End
2020-03-31
Budget Start
2017-04-01
Budget End
2020-03-31
Support Year
1
Fiscal Year
2017
Total Cost
$418,533
Indirect Cost
$119,178
Name
Western Michigan University
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
622364479
City
Kalamazoo
State
MI
Country
United States
Zip Code
49008