Computational analysis of mass spectrometry (MS) data presents a significant challenge, especially when MS-based proteomic approaches are applied to profile complex protein mixtures such as human serum or tissues. We propose to develop a set of statistical models and algorithms that will enable robust, accurate, and transparent analysis of large-scale quantitative tandem mass-spectrometry (MS/MS) based proteomic datasets from human clinical cancer specimens. To achieve this we will 1) develop novel data analysis methods and algorithms for statistical validation of peptide assignments to MS/MS spectra generated using any type of MS instrumentation, experimental sample preparation protocols, and MS/MS database search software 2) develop an integrated, probability-based informatics approach for assembling peptides into proteins and for inferring the identities and changes in the abundance of proteins between compared samples, thus increasing the power of the shotgun proteomic approach to identify low molecular weight and low abundance proteins, discriminate between protein isoforms, and detect post-translational processing events 3) introduce multivariate metrics for assessing the quality of MS/MS data and design iterative computational strategies for reanalysis of unassigned high quality spectra 4) develop statistical models for quantifying error rates in composite databases of peptide and protein identifications collected from different studies, thus enabling accurate cross-laboratory comparison, data mining, and selection of candidates for targeted proteomic profiling of clinical samples. We will integrate these methods and tools in the existing open source data analysis platform Trans-Proteomic Pipeline, and will disseminate the new tools, statistical methodologies and educational materials to the proteomic community. The ultimate goal of the proposed computational research is to enable fast and automated generation of high quality proteomic dataset with accurately determined error rates, thus removing one of the main technical barriers currently plaguing the field of proteomics.
Showing the most recent 10 out of 34 publications