Although heavily concentrated and studied, the statistical accuracy of peptide/protein identification remains challenging. There are many peptide identification methods using database searches and assigning the E-value to peptide hits, however, the E-values reported by different methods do not agree with each other and few of them, if any, agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods, particularly if one wishes to combine methods with user-assigned weights. When prior knowledge is available, it is often desirable to weight search methods differently before combining their search results. We have provided a way to combine search results democratically in one of our earlier publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate. In 2011, we have devised a mathematical framework to completely eliminate the possible instability. In 2015, we desinged a protein identification method that combines weighted P-values of evidence peptides. This new method solves the long-standing problem of precise type-I error control in protein identification. In addition, it also reports correctly the proportion of false discoveries, indication of accurate type-II error control. In 2016, we work on designing a new peptide significance assignment method based on the extreme value statistics. The motivation of this work is to provide accurate peptide identification confidence for methods that use scoring functions that cannot be expressed as a sum of independent contributions. This new method provides a generally applicable confidence assignment for any generic scoring function whose score distribution fall in the basin of attraction of the extreme value distributions. The results we have obtained are very encouraging and were published in Bioinformatics. In the last two years we also finished the first phase of a large collaborative project, involving scientists in NHLBI and Clinical Center, in pathogen identifications using mass spectrometry. The fundamental idea is to use each pathogen's peptidome to represent that pathogen. Through the use of mass spectrometry analysis, if the statistical significance assignment is accurate, one will be able to correctly rank the species/genus according to their peptidome simiarilty compared with the peptides identified. Again, we have to weight the evidence peptides associated with a given species/genus as one peptide often maps to multiple species/genus. Our results were published in the Journal of American Society of Mass Spectrometry. This year, we expand the pathogen project to the more challenging phase II: simultaneous identifications of multiple pathogens and the construction of an analysis pipeline that requires minimum human interventions. Our results are very encouraging and we are in the process of compiling the results for our next publication along this direction. This year we also spent a great deal of efforts in making our tools accessible to researchers in the community. We have made the protein identification function as well as the extreme value based peptide statistics available in our RAId web service in our group website www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_dbs/index.html We have also implemented standalone graphic user interfaces for RAId and pathogen ID for users intended to download our source codes and perform analyses locally. Since different diseases might share similar cause while similar diseases may actually come from different origins, we believe it is important to characterize disease relationship from a different perspective, the perspective of protein-protein interaction network. With this in mind, we have downloaded all diseases along with their associated genes stored in the Comparative Toxicogenomics Database (CTD) and analyze the disease-disease relations based on the similarity between their protein weight vectors, each obtained by using the disease genes as the sources and sinks in the interaction network containing proteins with documented interactions. We recently surveyed all such mechanism based disease similarity work and have written a mini review article in this direction. This review is recently published in Journal of rare diseases research and treatment. Induction of pluripotency in somatic cells has made a huge step forward for regenerative medicine. Many studies have shown that somatic cells can be reprogrammed to induced pluripotent stem cells (IPSCs). However, the underlying mechanism is not yet fully understood. A better understanding of the molecular mechanism of reprogramming will help generate high quality IPSCs and hopefully increase the efficiency of induction. We have devised a model that utilizes a gene regulatory network in two steps. The network is first perturbed by forced overexpression of a few reprogramming factors and is driven from the initial steady state (somatic cell) to an intermediate steady state. The perturbation is then switched off and the system relaxes to its final IPSC state. We derived a linear relation between the initial and final steady states using the commonly used nonlinear ODEs. The results are very encouraging and we are in the process of writing a manuscript for publication.

Project Start
Project End
Budget Start
Budget End
Support Year
14
Fiscal Year
2017
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Joyce, Brendan; Lee, Danny; Rubio, Alex et al. (2018) A graphical user interface for RAId, a knowledge integrated proteomics analysis suite with accurate statistics. BMC Res Notes 11:182
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2018) Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry. J Am Soc Mass Spectrom 29:1721-1737
Alves, Gelio; Yu, Yi-Kuo (2016) Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 32:2642-9
Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y et al. (2016) Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance. J Am Soc Mass Spectrom 27:194-210
Hamaneh, Mehdi B; Yu, Yi-Kuo (2015) DeCoaD: determining correlations among diseases using protein interaction networks. BMC Res Notes 8:226
Hamaneh, Mehdi Bagheri; Haber, Jonah; Yu, Yi-Kuo (2015) Analytical solution and scaling of fluctuations in complex networks traversed by damped, interacting random walkers. Phys Rev E Stat Nonlin Soft Matter Phys 92:052803
Alves, Gelio; Yu, Yi-Kuo (2015) Mass spectrometry-based protein identification with accurate statistical significance assignment. Bioinformatics 31:699-706
Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo (2014) Molecular Isotopic Distribution Analysis (MIDAs) with adjustable mass accuracy. J Am Soc Mass Spectrom 25:57-70
Hamaneh, Mehdi Bagheri; Yu, Yi-Kuo (2014) Relating diseases by integrating gene associations and information flow through protein interaction network. PLoS One 9:e110936
Alves, Gelio; Yu, Yi-Kuo (2014) Accuracy evaluation of the unified P-value from combining correlated P-values. PLoS One 9:e91225

Showing the most recent 10 out of 26 publications