Although heavily concentrated and studied, the statistical accuracy of peptide identification remains challenging. Although there are many peptide identification methods using database searches and assigning the E-value to peptide hits, the E-values reported by different methods do not agree with each other and none of them agree with the textbook definition of the E-value. For the past year, one of our major efforts is to develop statistical approach to properly take into account during data analysis the proteotypic peptides, that is, peptides that are consistently observed in mass spectrometry based proteomics experiments. We have illustrated that the proteotypic information does help retrieval performance provided that it is incorporated into the database with sufficient quality control. We have submitted our results to Journal of Proteomics for consideration of publication. Another direction that we embark on is to utilize the score statistics of all possible peptides in various applications. In 2008, we have shown the possibility of scoring trillions of trillions of peptides to form the score histogram of all possible peptides for a given MS spectrum and an additive scoring function. In the past year, we have turned this somewhat theoretical result into pragmatic use by re-expressing several well-known scoring functions in the field of computational proteomics into additive forms and thus obtain the unified score statistics for those scoring functions. A main difficulty we need to circumvent is to learn about how each scoring function pre-process the query spectrum. This critical step largely determines the final score of each candidate peptide. In order to achieve this task, we need to dig into other analysis programs to extract their heuristic filtering rules. After accomplishing this daunting task, we have successfully built an application tool that allows for (1) combining search results using all-possible-peptide score statistics and (2) reassignment of E-values. Our new application, RAId_aPS, is now available on our group website and the results are written and submitted to PLoS One for consideration of publication.

Project Start
Project End
Budget Start
Budget End
Support Year
7
Fiscal Year
2010
Total Cost
$372,153
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Joyce, Brendan; Lee, Danny; Rubio, Alex et al. (2018) A graphical user interface for RAId, a knowledge integrated proteomics analysis suite with accurate statistics. BMC Res Notes 11:182
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2018) Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry. J Am Soc Mass Spectrom 29:1721-1737
Alves, Gelio; Yu, Yi-Kuo (2016) Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 32:2642-9
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2016) Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance. J Am Soc Mass Spectrom 27:194-210
Hamaneh, Mehdi B; Yu, Yi-Kuo (2015) DeCoaD: determining correlations among diseases using protein interaction networks. BMC Res Notes 8:226
Hamaneh, Mehdi Bagheri; Haber, Jonah; Yu, Yi-Kuo (2015) Analytical solution and scaling of fluctuations in complex networks traversed by damped, interacting random walkers. Phys Rev E Stat Nonlin Soft Matter Phys 92:052803
Alves, Gelio; Yu, Yi-Kuo (2015) Mass spectrometry-based protein identification with accurate statistical significance assignment. Bioinformatics 31:699-706
Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo (2014) Molecular Isotopic Distribution Analysis (MIDAs) with adjustable mass accuracy. J Am Soc Mass Spectrom 25:57-70
Hamaneh, Mehdi Bagheri; Yu, Yi-Kuo (2014) Relating diseases by integrating gene associations and information flow through protein interaction network. PLoS One 9:e110936
Alves, Gelio; Yu, Yi-Kuo (2014) Accuracy evaluation of the unified P-value from combining correlated P-values. PLoS One 9:e91225

Showing the most recent 10 out of 26 publications