Although heavily concentrated and studied, the statistical accuracy of peptide identification remains challenging. Although there are many peptide identification methods using database searches and assigning the E-value to peptide hits, the E-values reported by different methods do not agree with each other and none of them agree with the textbook definition of the E-value. In 2007, we developed a new database search method for peptide identification, RAId_DbS, that is able to provide more accurate E-value (or statistical significance) than other existing methods. In addition, it is also shown that in terms of information retrieval efficiency, RAId_DbS is at least comparable to or better than best existing methods. We also developed a new protocol to calibrate the statistics for any database search method. This protocol allows the user to transform the score or E-value reported by a certain search method into a standardized E-value that is derived from the fundamental definition of E-value. As a consequence, this protocol enables comparison between results obtained from different search methods, analyzed by different laboratories etc. Last year, we proposed a protocol for properly combining search methods and showed its effectiveness in improving retrieval accuracy. We also investigated a chromatography-induced limiting factor, peptide co-elution, in identifying peptides from MS/MS experiments. We showed that due to the limitation in chromatographic separations it is inevitable to have peptide co-elution in a large fraction of spectra. We have also performed a study on how well the current search methods may deal with spectra containing multiple co-eluted peptides. This year, we focus on two important aspects of peptide identifications. First, we construct a new database incorporating documented information of amino acid modifications and polymorphisms associated with each protein. When these polymorphisms/modifications are mapped to diseases, the diseases information is also included. When one runs RAId_DbS using this database, a user may select to enable the search of those documented polymorphisms/modifications. When a significant hit is found with polymorphisms/modifications in it that are linked to diseases, RAId_DbS will then be able to report these diseases along with links to the original literature. We believe this knowledge integration should benefit significantly clinical use of mass spectrometry based screening. The implementation and results were published in BMC Genomics. Another major effort of this year's research is towards the construction of a universal protocol for peptide identification significance. We proposed to use score histogram of """"""""all possible"""""""" peptides, including those that are not in the databases, as the global background to obtain the P-value. And from there one may infer proper E-value. To reach this goal, one needs to be able to score thousands of trillions of peptides in a short time. We have devised a dynamical programming tool to achieve this task. Without our progress, if one were to follow this idea, it will take a long time to analyze a single MS/MS spectrum. Imagine that one has a super fast computer that can score one billion peptides per second, to score one thousand trillion peptides will still take more than 11 days. Our method enables us to construct a score histogram resulting from trillions of trillions peptides in less than 10 second. This method is published in Physica A.
Showing the most recent 10 out of 26 publications