The use of genomic measures for precision medicine will depend critically on our ability to identify genes whose expression impacts the initiation, progression, and severity of common diseases such as sporadic cancer. A multitude of powerful computational and statistical methods have been developed over the last 20 years to assist with this endeavor. However, the vast majority of these approaches focus on error or related measures such as sensitivity and specificity as a measure of model quality. These measures are important but do not capture other measures of model quality that may be meaningful to biomedical researchers and physicians. We propose here to develop a comprehensive approach to modeling genomics data that takes into consideration multiple objective and subjective measures of model quality simultaneously. It is our working hypothesis that multiobjective methods will yield results that are more consistent, more reproducible, and with greater clinical impact. Specifically, we will develop a novel Hierarchical Pareto Optimization (HiParOp) algorithm that is capable of integrating multiple criteria for a given computational model of gene expression and clinical outcomes (AIM 1). This approach will first be validated with simulated gene expression data that reflect the hierarchical complexity of cancer. We will then evaluate the HiParOp algorithm by applying it to several well-studied and well-characterized breast cancer data sets that have led to diagnostic tests and new drug targets (AIM 2). Here, we will include a long list of measures of model quality that include traditional objective measures such as the cohesiveness or distinctiveness of tumor clusters as well as subjective measures such as clinical relevance and druggability. Experience applying HiParOp to a well-studied cancer where significant progress has been made will be used to make further refinements to the algorithm. We will then apply the HiParOp approach to the genomic analysis of non-small cell lung cancer (NSCLC) where there is substantial opportunity for improved diagnosis and treatment. We will analyze several carefully conducted gene expression studies in NSCLC cancer tissue (AIM 3). Finally, we will develop and release an R package that will allow others to easily implement the HiParOp method (AIM 4).
The use of genomic measures for precision medicine will depend critically on our ability to identify genes whose expression impacts the initiation, progression, and severity of common diseases such as sporadic cancer. Current approaches for computational analysis focus on prediction error as a measure of model quality. We propose here to develop a comprehensive approach to modeling genomics data that takes into consideration multiple objective and subjective measures of model quality simultaneously.
Moore, Jason H; Shestov, Maksim; Schmitt, Peter et al. (2018) A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods. Pac Symp Biocomput 23:259-267 |
Piette, Elizabeth R; Moore, Jason H (2018) Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV). BioData Min 11:6 |
Causey, Jason L; Ashby, Cody; Walker, Karl et al. (2018) DNAp: A Pipeline for DNA-seq Data Analysis. Sci Rep 8:6793 |
Causey, Jason L; Zhang, Junyu; Ma, Shiqian et al. (2018) Highly accurate model for prediction of lung nodule malignancy with CT scans. Sci Rep 8:9286 |
Olson, Randal S; Cava, William La; Mustahsan, Zairah et al. (2018) Data-driven advice for applying machine learning to bioinformatics problems. Pac Symp Biocomput 23:192-203 |