The main goal of this proposal is to develop novel and improved statistical methods for analyzing high dimensional proteomic data generated from mass spectrometers. These data usually consist of spectra each with thousands of features. However, these features contain true signals of proteins/peptides and noises. This proposal focuses on three interconnecting (and sequential) goals: 1) separating the true peaks from chemical noise in a mass spectrum using statistical modeling and hypotheses test, 2) comprehensive evaluation and aggregate ranking of a number of classification techniques to classify the case and control samples using proteomic profiles and construction of an adaptive classifier which is expected to perform better than individual classifiers under an ensemble of performance measures and 3) construction of a protein-protein association network from the truly classifying peaks in a case-control study by reverse engineering. An overall and ultimate goal of this proposed research is to study the performance of the three pieces put together in a sequential manner to understand the inner working of proteins in a case-control study based on mass spectrometry data.
High throughput proteomic profiling using mass spectrometry measurements have enormous potential in scientific/biomedical research. Identification of proteomic biomarkers for complex diseases and conditions like cancer, acute renal disorder and fetal alcohol syndrome etc. from easily available bodily fluids like blood, plasma, urine, amniotic fluid and serum could be very beneficial. These biomarkers are expected to be much more sensitive and specific than the existing ones and hence are better in terms of early detection and prevention of such diseases and conditions. Proteomic signature profiling also can be used to quickly identify different biological agents (as for example, anthrax). This particular application demonstrates its implication in the matters related to homeland security. Similarly, proteomic profiling of bodily fluids of subjects exposed to different environmental toxins can also be useful. However, complexity of these data poses new statistical challenges for their analysis. Hence proper analytic tools are much needed for the proper utilization of these data. The proposed research is expected to make significant contribution towards this relatively new area of research. Last but not the least, the analytical and computational tools developed for this project can be used to analyze other types of high dimensional data.
This NSF funded research lead to novel statistical methodology to analyze Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight (MALDI-TOF) and Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF) mass spectrometry proteomics data. These new methods are expected to overcome some of the criticisms (such as reproducibility) of the result using these technologies. Careful use of these technologies along with the statistical tools emerged from this research enable consistent identification of new bio-molecules in real time. Applications of these types of proteomics based classifications of samples include Bioterrorism (e.g., confirmation and identification of bacterial types from a suspected specimen) and human medicine (disease biomarkers). Thus, it will potentially have profound impact on homeland security, human medicine and public health. Moreover, these methodologies can be adapted to other high dimensional data from engineering as well. Association based determination of protein-protein interaction network and the differential nature of it in a case-control study can identify the mechanisms of complex diseases. The PI has an established track record of directing graduate students. The student support (graduate research assistant) from NSF helped her current and future students to focus on their dissertation research related to some of these specific aims. Three students directly related to the scientific research involved in this project received employments soon after completion of the PhD without any difficulties. The fourth student is well underway pursuing her research in a closely related area of computational proteomics. In addition, some of these computational methods were incorporated in the course "High-throughput Data Analysis" taught by the PI in the doctoral program at the University of Louisville. Open source software R package developed by the PI and the students are distributed to the scientific community. It is anticipated that the users will be able to analyze proteomics data without any commercial software packages and will produce reproducible result helpful for the biomedical, public health and engineering research. Last but not least, the proposed research resulted in at least thirteen peer-reviewed articles and two book chapters and one conference proceedings. Many seminars, colloquium presentations and scientific talks in professional meetings were delivered by the PI and her students for the dissemination of the results obtained from this proposal