Although heavily concentrated and studied, the statistical accuracy of peptide identification remains challenging. There are many peptide identification methods using database searches and assigning the E-value to peptide hits, however, the E-values reported by different methods do not agree with each other and none of them agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods. In particular, if one wishes to combine methods with user-assigned weights. For the past year, one of our major efforts is to develop statistical approach to properly take into account during data analysis the proteotypic peptides, that is, peptides that are consistently observed in mass spectrometry based proteomics experiments. We have illustrated that the proteotypic information does help retrieval performance provided that it is incorporated into the database with sufficient quality control. We have published our results in Journal of Proteomics (doi:10.1016/j.jprot.2010.10.005). Another direction that we embark on is to utilize the score statistics of all possible peptides in various applications. In 2008, we have shown the possibility of scoring trillions of trillions of peptides to form the score histogram of all possible peptides for a given MS spectrum and an additive scoring function. In the past year, we have turned this somewhat theoretical result into pragmatic use by re-expressing several well-known scoring functions in the field of computational proteomics into additive forms and thus obtain the unified score statistics for those scoring functions. A main difficulty we need to circumvent is to learn about how each scoring function pre-process the query spectrum. This critical step largely determines the final score of each candidate peptide. In order to achieve this task, we need to dig into other analysis programs to extract their heuristic filtering rules. After accomplishing this daunting task, we have successfully built an application tool that allows for (1) combining search results using all-possible-peptide score statistics and (2) reassignment of E-values. Our new application, RAId_aPS, is now available on our group website and the results are written and published in PLoS One (doi:10.1371/journal.pone.0015438). When prior knowledge is available, it is often desirable to weight search methods differently before combining their search results. We have provided a way to combine search results democratically in one of our 2008 publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate. In the past year, we have devised a mathematical framework to completely eliminate the possible instability. This work is recently published in PLoS One ( doi:10.1371/journal.pone.0022647).

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Joyce, Brendan; Lee, Danny; Rubio, Alex et al. (2018) A graphical user interface for RAId, a knowledge integrated proteomics analysis suite with accurate statistics. BMC Res Notes 11:182
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2018) Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry. J Am Soc Mass Spectrom 29:1721-1737
Alves, Gelio; Yu, Yi-Kuo (2016) Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 32:2642-9
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2016) Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance. J Am Soc Mass Spectrom 27:194-210
Hamaneh, Mehdi B; Yu, Yi-Kuo (2015) DeCoaD: determining correlations among diseases using protein interaction networks. BMC Res Notes 8:226
Hamaneh, Mehdi Bagheri; Haber, Jonah; Yu, Yi-Kuo (2015) Analytical solution and scaling of fluctuations in complex networks traversed by damped, interacting random walkers. Phys Rev E Stat Nonlin Soft Matter Phys 92:052803
Alves, Gelio; Yu, Yi-Kuo (2015) Mass spectrometry-based protein identification with accurate statistical significance assignment. Bioinformatics 31:699-706
Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo (2014) Molecular Isotopic Distribution Analysis (MIDAs) with adjustable mass accuracy. J Am Soc Mass Spectrom 25:57-70
Hamaneh, Mehdi Bagheri; Yu, Yi-Kuo (2014) Relating diseases by integrating gene associations and information flow through protein interaction network. PLoS One 9:e110936
Alves, Gelio; Yu, Yi-Kuo (2014) Accuracy evaluation of the unified P-value from combining correlated P-values. PLoS One 9:e91225

Showing the most recent 10 out of 26 publications