The dramatic improvement in data collection and acquisition technologies in the past decades has enabled scientists to collect vast amounts of health-related data from biomedical studies. If analyzed properly, these data will expand our knowledge for testing new hypotheses about disease management from diagnosis to prevention to personalized treatment. However, the biomedical data can be rather complex, how to analyze them has posed many challenges on the existing methods. This proposal attempts to address three fundamental challenges: (i) Missing data are ubiquitous in biomedical research, how to make a sufficient use of biomedical complex data in presence of missing values? (ii) With the growing data size, typically comes a growing complexity of the patterns in the data and of the models needed to account for the patterns. What is the general recipe for estimating parameters of complex models? (iii) Biomarker identification from high-throughput omics data has been one of major focuses in cancer research. Yet despite intense effort, the number of biomarkers approved by FDA each year for clinical use is still in single digits. An important factor contributing to this failure is the lack of appropriate statistical methods for analyzing such heterogeneous and high-dimensional data. Toward a sufficient use of biomedical complex data, this project proposes an imputation-consistency algorithm as a general algorithm for high-dimensional missing data problems. Then the algorithm is extended to address other two challenges under the principles of conditioning and consistency; in particular, this project proposes some highly efficient and effective statistical algorithms that address the heterogeneity and high-dimensionality issues encountered in biomarker identifications and eQTL analysis. The proposed algorithms are applied to (i) select anticancer drug sensitive genes with the CCLE and SANGER data, (ii) identify prognostic mRNA biomarkers for multiple types of cancers using the TCGA data, (iii) conduct eQTL analysis for multiple types of cancers using the TCGA data, and (iv) identify informative circulating biomarkers for type 1 diabetes. The proposed methods are highly efficient and general and can be applied to other types of disease as well. Statistically, this project is to develop some general, effective, and highly efficient algorithms for complex data analysis; biomedically, this project will significantly improve accuracy of biomarker identification from omics data, which advances people's understanding of molecular mechanism and development of precision medicine. 1

Public Health Relevance

Successful completion of this project will generate hands-on tools for biomedical complex data analysis and identify some biomarkers that are potentially in clinics for type 1 diabetes and multiple cancers. This will improve our understanding to the mechanism of complex diseases and our ability to predict disease risk and prognosis, and accelerate the integration of biomarkers into clinical trials and the development of personalized medicine, which ultimately will enhance our public health system and improve patient care. 1

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Purdue University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
West Lafayette
United States
Zip Code