Although heavily concentrated and studied, the statistical accuracy of peptide/protein identification remains challenging. There are many peptide identification methods using database searches and assigning the E-value to peptide hits, however, the E-values reported by different methods do not agree with each other and few of them, if any, agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods. In particular, if one wishes to combine methods with user-assigned weights. When prior knowledge is available, it is often desirable to weight search methods differently before combining their search results. We have provided a way to combine search results democratically in one of our earlier publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate. In 2011, we have devised a mathematical framework to completely eliminate the possible instability. In 2015, we desinged a protein identification method that combines weighted P-values of evidence peptides. This new method solves the long-standing problem of precise type-I error control in protein identification. In addition, it also reports correctly the proportion of false discoveries, indication of accurate type-II error control. In the past year, we work on designing a new peptide significance assignment method based on the extreme value statistics. The motivation of this work is to provide accurate peptide identification confidence for methods that use scoring functions that cannot be expressed as a sum of independent contributions. This new method provides a generally applicable confidence assignment for any generic scoring function whose score distribution fall in the basin of attraction of the extreme value distributions. The results we have obtained are very encouraging and are published in Bioinformatics this year. We have made the protein identification function as well as the extreme value based peptide statistics available in our RAId web service in our group website www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_dbs/index.html Last year we finished the first phase of a large collaborative project, involving scientists in NHLBI and Clinical Center, in pathogen identifications using mass spectrometry. The fundamental idea is to use each pathogen's peptidome to represent that pathogen. Through the use of mass spectrometry analysis, if the statistical significance assignment is accurate, one will be able to correctly rank the species/genus according to their peptidome simiarilty compared with the peptides identified. Again, we have to weight the evidence peptides associated with a given species/genus as one peptide often maps to multiple species/genus. Our results are recently published this year in the Journal of American Society of Mass Spectrometry. This year, we expand the pathogen project to the more challenging phase II: simultaneous identifications of multiple pathogens. Our preliminary results arevery encouraging and we are in the process of compiling the results for our next publication along this direction. Since different diseases might share similar cause while similar diseases may actually come from different origins, we believe it is important to characterize disease relationship from a different perspective, the perspective of protein-protein interaction network. With this in mind, we have downloaded all diseases along with their associated genes stored in the Comparative Toxicogenomics Database (CTD) and analyze the disease-disease relations based on the similarity between their protein weight vectors, each obtained by using the disease genes as the sources and sinks in the interaction network containing proteins with documented interactions. We recently surveyed all such mechanism based disease similarity work and have written a mini review article in this direction. Our own results were pulbished earlier in PLoS One; and the web service DeCoaD, allowing users to look for similar diseases to the input based on interaction networks, at www.ncbi.nlm.nih.gov/CBBresearch/Yu/mn/DeCoaD/index.html along with the implementation are recently published in BMC Research Note.
Showing the most recent 10 out of 26 publications