In 2013 over 1.6 million new cases of cancer are expected to be diagnosed and over 580,000 people are expected to die of the disease. Thus, continued research in the identification of new diagnostic and prognostic biomarkers of cancer is necessary. Although cancer is widely recognized as a genomic disease, the directives of the DNA-based drivers are executed at the level of proteins and their biological functions, and the application of potential protein level biomarkers remains a compelling vision. Thus, a large investment has been made by NCI and other research centers in high-throughput global proteomics experiments to mine for novel biomarkers of cancer. However, few of these markers have come to fruition. We believe that one of the major challenges to the discovery of robust protein- or pathway-biomarker candidates from these large and complex proteomics datasets is due to naive data analysis approaches that do not take into account the underlying complexity of the proteome (e.g., splice variants, post- translational modifications). State-of-the-art statistical algorithms to improve the tasks of quality assessment, peptide and protein quantification, and pathway modeling that are designed to account for the design of the experiment have been developed;however access to these methodologies by the larger community is hindered since they are in the prototype stage and typically require knowledge of statistical programming. Furthermore, the likelihood of these tools moving to robust software is low since they are developed within the context of existing grants that do not support the transition from prototype to software. For the field of clinical proteomics to successfully identif new mechanistic etiologies of cancer requires not only high quality data with respect to the instrument, but also high quality statistical analysis of the data. This project proposes new informatics technology in the form of a robust, interactive and cross- platform software environment that will enable biomedical and biological scientists to perform in-depth analyses of global proteomics data from the point of quality assessment and normalization of raw inferred abundances (e.g., peak area) to the identification of protein biomarkers and enriched pathways. The software will be designed in a single programming language (Java) to assure easy installation across platforms with wizard-based data entry and advanced data reporting. Java will also support the development of advanced graphical user interfaces for data presentation and interactive graphics with a modern look and feel. This approach will ensure that scientists outside of the development institution can develop modules to include in the software or extensions for data integration without challenges of re-compiling the application. The software modules to be developed under this project are Aim 1) peptide and protein level quality assessment and quantification, Aim 2) protein biomarker discovery via exploratory data analysis and machine learning, and Aim 3) pathway biomarker discovery through integration with the NCI Protein Interaction Database.
For the past decade, cancer researchers have been utilizing global proteomics analyses to extensively categorize proteins and other molecular species in hopes of identifying distinctive features of cancer cells that not only explain the biology, but alo enable better patient care. Despite these investments, relatively few protein biomarkers have achieved clinical validation largely due to naive data analysis strategies used in the protein quantification and statistical validation of candidate biomarkers. This project will develop a robust user- friendly software environment that builds upon state-of-the-art statistical algorithms that are focused on addressing the underlying proteome complexity associated with cancer.