Although heavily concentrated and studied, the statistical accuracy of peptide identification remains challenging. Although there are many peptide identification methods using database searches and assigning the E-value to peptide hits, the E-values reported by different methods do not agree with each other and none of them agree with the textbook definition of the E-value. In 2007, we developed a new database search method for peptide identification, RAId_DbS, that is able to provide more accurate E-value (or statistical significance) than other existing methods. In addition, it is also shown that in terms of information retrieval efficiency, RAId_DbS is at least comparable to or better than best existing methods. We also developed a new protocol to calibrate the statistics for any database search method. This protocol allows the user to transform the score or E-value reported by a certain search method into a standardized E-value that is derived from the fundamental definition of E-value. As a consequence, this protocol enables comparison between results obtained from different search methods, analyzed by different laboratories etc. Last year, we proposed a protocol for properly combining search methods and showed its effectiveness in improving retrieval accuracy. We also investigated a chromatography-induced limiting factor, peptide co-elution, in identifying peptides from MS/MS experiments. We showed that due to the limitation in chromatographic separations it is inevitable to have peptide co-elution in a large fraction of spectra. We have also performed a study on how well the current search methods may deal with spectra containing multiple co-eluted peptides. This year, we focus on two important aspects of peptide identifications. First, we construct a new database incorporating documented information of amino acid modifications and polymorphisms associated with each protein. When these polymorphisms/modifications are mapped to diseases, the diseases information is also included. When one runs RAId_DbS using this database, a user may select to enable the search of those documented polymorphisms/modifications. When a significant hit is found with polymorphisms/modifications in it that are linked to diseases, RAId_DbS will then be able to report these diseases along with links to the original literature. We believe this knowledge integration should benefit significantly clinical use of mass spectrometry based screening. The implementation and results were published in BMC Genomics. Another major effort of this year's research is towards the construction of a universal protocol for peptide identification significance. We proposed to use score histogram of """"""""all possible"""""""" peptides, including those that are not in the databases, as the global background to obtain the P-value. And from there one may infer proper E-value. To reach this goal, one needs to be able to score thousands of trillions of peptides in a short time. We have devised a dynamical programming tool to achieve this task. Without our progress, if one were to follow this idea, it will take a long time to analyze a single MS/MS spectrum. Imagine that one has a super fast computer that can score one billion peptides per second, to score one thousand trillion peptides will still take more than 11 days. Our method enables us to construct a score histogram resulting from trillions of trillions peptides in less than 10 second. This method is published in Physica A.

Project Start
Project End
Budget Start
Budget End
Support Year
6
Fiscal Year
2009
Total Cost
$350,139
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Joyce, Brendan; Lee, Danny; Rubio, Alex et al. (2018) A graphical user interface for RAId, a knowledge integrated proteomics analysis suite with accurate statistics. BMC Res Notes 11:182
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2018) Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry. J Am Soc Mass Spectrom 29:1721-1737
Alves, Gelio; Yu, Yi-Kuo (2016) Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 32:2642-9
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2016) Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance. J Am Soc Mass Spectrom 27:194-210
Alves, Gelio; Yu, Yi-Kuo (2015) Mass spectrometry-based protein identification with accurate statistical significance assignment. Bioinformatics 31:699-706
Hamaneh, Mehdi B; Yu, Yi-Kuo (2015) DeCoaD: determining correlations among diseases using protein interaction networks. BMC Res Notes 8:226
Hamaneh, Mehdi Bagheri; Haber, Jonah; Yu, Yi-Kuo (2015) Analytical solution and scaling of fluctuations in complex networks traversed by damped, interacting random walkers. Phys Rev E Stat Nonlin Soft Matter Phys 92:052803
Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo (2014) Molecular Isotopic Distribution Analysis (MIDAs) with adjustable mass accuracy. J Am Soc Mass Spectrom 25:57-70
Hamaneh, Mehdi Bagheri; Yu, Yi-Kuo (2014) Relating diseases by integrating gene associations and information flow through protein interaction network. PLoS One 9:e110936
Alves, Gelio; Yu, Yi-Kuo (2014) Accuracy evaluation of the unified P-value from combining correlated P-values. PLoS One 9:e91225

Showing the most recent 10 out of 26 publications