Assigning statistical significance accurately has become increasingly important as metadata of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of metadata at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry-based proteomics, this implies that accurate statistics must be obtained in peptide identification, then built on it one can hopefully have protein identification method(s) with accurate statistical significance assignment. However, although heavily concentrated and studied, the statistical accuracy of peptide/protein identification remains challenging. There are many peptide identification methods using database searches and assigning the E-value to peptide hits, however, the E-values reported by different methods do not agree with each other and few of them, if any, agree with the textbook definition of the E-value. This obviously hinders the feasibility of combining search results from different methods, particularly if one wishes to combine methods with user-assigned weights. When prior knowledge is available, it is often desirable to weight search methods differently before combining their search results. In our earlier publications, we have developed peptide identifications methods with accurate statistical significance assignment founded on the extension of central limit theorem, and all possible peptide statistics ; we have provided a way to combine search results democratically in one of our earlier publications. When different weights are present, an instability issue occurs if some of the weights are nearly degenerate; we have devised a mathematical framework to completely eliminate the possible instability. We have recently designed a protein identification method that combines weighted P-values of evidence peptides. This new method solves the long-standing problem of precise type-I error control in protein identification. In addition, it also reports correctly the proportion of false discoveries, indication of accurate type-II error control. In 2016, we work on designing a new peptide significance assignment method based on the extreme value statistics. The motivation of this work is to provide accurate peptide identification confidence for methods that use scoring functions that cannot be expressed as a sum of independent contributions. This new method provides a generally applicable confidence assignment for any generic scoring function whose score distribution fall in the basin of attraction of the extreme value distributions. The results we have obtained are very encouraging and were published in Bioinformatics. In the past years we also worked on a large collaborative project, involving scientists in NHLBI and Clinical Center, in pathogen identifications using mass spectrometry. The fundamental idea is to use each pathogen's peptidome to represent that pathogen. Through mass spectrometry analysis, if the statistical significance assignment is accurate, one will be able to correctly rank the species/genus according to their peptidome similarity compared with the peptides identified. Again, we have to weight the evidence peptides associated with a given species/genus as one peptide often maps to multiple species/genus. The first phase results were recently published in the Journal of American Society of Mass Spectrometry. For the past two years, we expanded the pathogen project to the more challenging phase II: simultaneous identifications of multiple pathogens and the construction of an analysis pipeline that requires minimum human interventions. Our results are encouraging and published this year in the Journal of American Society of Mass Spectrometry. This year we also spent a great deal of efforts in making our tools accessible to researchers in the community. We have made the protein identification function as well as the extreme value based peptide statistics available in our RAId web service in our group website www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid/raid.html . We are preparing an application note to publicize these tools developed by us. We have also implemented standalone graphic user interfaces for RAId and pathogen ID for users intended to download our source codes and perform analyses locally. The note describing our graphic user interface was recently published in BMC Research Note. Induction of pluripotency in somatic cells has made a huge step forward for regenerative medicine. Many studies have shown that somatic cells can be reprogrammed to induced pluripotent stem cells (IPSCs). However, the underlying mechanism is not yet fully understood. A better understanding of the molecular mechanism of reprogramming will help generate high quality IPSCs and hopefully increase the efficiency of induction. We have devised a model that utilizes a gene regulatory network in two steps. The network is first perturbed by forced overexpression of a few reprogramming factors and is driven from the initial steady state (somatic cell) to an intermediate steady state. The perturbation is then switched off and the system relaxes to its final IPSC state. We derived a linear relation between the initial and final steady states using the commonly used nonlinear ODEs. The results are very encouraging and we just sent the manuscript for consideration for publication.
Showing the most recent 10 out of 26 publications